Building Synthetic Voices 



Alan W Black 
V Kevin A. Lenzo 

C3 



o 



Building Synthetic Voices 

by Alan W Black and Kevin A. Lenzo 

For FestVox 2.1 Edition 

Copyright © 1999-2007 Alan W Black & Kevin A. Lenzo 



Permission to use, copy, modify and distribute this document for any purpose and without fee is hereby granted in 
perpetuity, provided that the above copyright notice and this paragraph appear in all copies. 




Table of Contents 



I. Speech Synthesis 1 

1. Overview of Speech Synthesis 1 

History 1 

Uses of Speech Synthesis 3 

General Anatomy of a Synthesizer 3 

2. Speech Science 7 

3. A Practical Speech Synthesis System 9 

JJasic Use 10 

Utterance structure 12 

Modules 13 

Utt^ance access 15 

Utterance building 18 

Extracting features from utterances 20 

II. Building Syntk&th} Voices 23 

4. Basic Requirements 23 

Hardware |£Ss)ftware requirements 23 

Voice in a nqj*<anguage 23 

Voice in an existing language 24 

Selecting a spe^ktr 24 

Who owns a voi$) 25 

Recording under yrtix 26 

Extracting pitchmaYKS.from waveforms 28 

5. Limited domain synthesis 35 

designing the prompts^ 35 

customizing the synthesizer front end 36 

autolabeling issues Sr.^v-. 37 

unit size and type V 37 

using limited domain synthesizers 38 

Telling the time 39 

Making it better S.^ 44 

6. Text analysis >y^v> 47 

Non-standard words analysis 47 

Token to word rules 47 

Number pronunciation >. 51 

Homograph disambiguation 52 

TTS modes 52 

Mark-up modes 52 

7. Lexicons 5..-^ 55 

Word pronunciations .SwX 55 

Lexicons and addenda ^C^X 55 

Out of vocabulary words -r.s\ 56 

Building letter-to-sound rules by hand ^7 

Building letter-to-sound rules automatically 58 

Post-lexical rules Q). 62 

Building lexicons for new languages 63 

8. Building prosodic models 65 

Phrasing •. 65 

Accent /Boundary Assignment (Ti. 69 

F0 Generation /TT\ 71 

Duration ^f^. 75 

Prosody Research 79 

Prosody Walkthrough 80 

9. Corpus development 89 

Non-Latin-script languages 91 

10. Waveform Synthesis 93 

11. Diphone databases 95 

Diphone introduction 95 



Hi 



Defining a diphone list.. 

Recording the diphones 101 

Labeling the diphones 102 

Extracting the pitchmarks 105 

Building LPC parameters 106 

Defining a diphone voice 108 

Checking and correcting diphones 108 

Diphone check list 109 

12. Unit selection databases Ill 

fluster unit selection Ill 

«dUding a Unit Selection Cluster Voice 122 

Djpiiones from general databases 124 

13. Statisfeal Parametric Synthesis 

Building a CLUSTERGEN Statistical Parametric Synthesizer 

14. Labeling Speech 131 

Labeling wjth Dynamic Time Warping 131 

LabelingrXwh Full Acoustic Models 132 

Prosodic Labeling 135 

15. Evaluation ancTJfliprovements 

Evaluation . 1^). 

Does it work a^JI? 

Formal EvaluatioovTests 138 

Debugging voicesY^. 139 

III. Interfacing and Integration..... 141 

16. Markup x!.^ 141 

17. Concept-to-speech 143 

18. Deployment .Tlrv 145 

IV. Recipes ^ 147 

19. A Japanese Diphone Voice .~^.t. 

20. US/UK English Diphone Synmesizer 153 

21. ldom full example (^.: 159 

22. Non-english ldom example 161 

V. Concluding Remarks ^ 163 

23. Concluding remarks and future 163 

24. Festival Details 165 

25. Festival's Scheme Programming Language 167 

Overview Sr^..^ 

Data Types X--. 168 

Functions 170 

Core functions /rs. 170 

List functions ^>r>^^„. 171 

Arithmetic functions \J.... 173 

I/O functions ifrQ 173 

String functions sr\ 175 

System functions Sr^v: 

Utterance Functions \ 178 

Synthesis Functions * 178 

Debugging and Help /m^ 178 

Adding new C++ functions to Scheme --/-v 178 

Regular Expressions 178 

Some Examples 180 

26. Edinburgh Speech Tools 181 

27. Machine Learning 183 

28. Resources 185 

Festival resources 185 

General speech resources 186 



29. Tools Installation 189 

30. English phone lists 191 

US phoneset 191 

UK phoneset 194 




v 



Chapter 1 . Overview of Speech Synthesis 



History 



* AWB: probably way too biased as a history 



The idea that a machine could generate speech has been with us for some time, but 
the realization of such machines has only really been practical within the last 50 years. 
Even more^ecently it's in the last 20 years or so that we've seen practical examples 
of text-to-sp^ech systems that can say any text they're given — though it might be 
"wrong." 

The creation d^^nthetic speech covers a whole range of processes, and though of- 
ten they are all Jumped under the general term text-to-speech, a good deal of work 
has gone into generating speech from sequences of speech sounds; this would be a 
speech-sound (phoneme) to audio waveform synthesis, rather than going all the way 
from text to phoneir\eg (speech sounds), and then to sound. 

One of the first practQfci application of speech synthesis was in 1936 when the U.K. 
Telephone Company iMm)duced a speaking clock. It used optical storage for the 
phrases, words, and parWsords ("noun," "verb," and so on) which were appropri- 
ately concatenated to form- complete sentences. 

Also around that time, Hoine^Dudley developed a mechanical device at Bell Lab- 
oratories that operated througtt the movement of pedals, and mechanical keys, like 
an organ. With a trained operator, it could be made to create sounds that, if given 
a good set-up, almost sounded ]g&& speech. Called the Voder, it was demonstrated 
at the 1939 World's Fair in New "rorJ< and San Francisco. A recording of this device 
exists, and can be heard as part of w collection of historical synthesis examples that 
were distributed on a record as part ($f fklatt87]. 

The realization that the speech signalVould be decomposed as a source-and-filter 
model, with the glottis acting as a soundfsource and the oral tract being a filter, was 
used to build analog electronic devices tha^could be used to mimic human speech. 
The vocoder, also developed by Homer Dudley, is one such example. Much of the 
work in synthesis in the 40s and 50s was primarily concerned with constructing repli- 
cas of the signal itself rather than generating phones from an abstract form like 
text. 

Further decomposition of the speech signal alloweifthe development oi formant syn- 
thesis, where collections of signals were composed to fp>m recognization speech. The 
prediction of parameters that compactly represent tns signal, without the loss of any 
information critical for reconstruction, has always be£n) and still is, difficult. Early 
versions of formant synthesis allowed these to be specif^d by hand, with automatic 
modeling as a goal. Today, formant synthesizers can promise high quality, recogniz- 
able speech if the parameters are properly adjusted, and thesesystems can work very 
well for some applications. It's still hard to get fully natuX^[3ounding speech from 
these when the process is fully automatic — as it is from all synthesis methods. 



With the rise of digital representations of speech, digital signal^processing, and the 
proliferation of cheap, general-purpose computer hardware, more work was done 
in concatenation of natural recorded speech. Diphones appeared; trtatis, two adjacent 
half -phones (context-dependent phoneme realizations), cut in the middle/ joined into 
one unit. The justification was that phone boundaries are much mor^Rynamic than 
stable, interior parts of phones, and therefore mid-phone is a better pla^oto concate- 
nate units, as the stable points have, by definition, little rapid change, whereas there 
are rapid changes at the boundaries that depend upon the previous or next unit. 

The rise of concatenative synthesis began in the 70s, and has largely become practi- 
cal as large-scale electronic storage has become cheap and robust. When a megabyte 
of memory was a significant part of researchers salary, less resource-intensive tech- 
niques were worth their... weight in saved cycles in gold, to use an odd metaphor. Of 
course formant, synthesis can still require significant computational power, even if it 




1 



Chapter 1. Overview of Speech Synthesis 



requires less storage; the 80s speech synthesis relied on specialized hardware to deal 
with the constraints of the time. 

In 1972, the standard Unix manual (3rd edition) included commands to process text 
to speech, form text analysis, prosodic prediction, phoneme generation, and wave- 
form synthesis through a specialized piece of hardware. Of course Unix had only 
about 16 installations at the time and most, perhaps even all, were located in Bell 
Labs at Murray Hill. 

Techniques were developed to compress (code) speech in a way that it could be more 
easily usedjn applications. The Texas Instruments Speak 'n Spell toy, released in the 
late 70s, wss\>ne of the early examples of mass production of speech synthesis. The 
quality was^sor, by modern standards, but for the time it was very impressive. 
Speech was ba^jgaHy encoded using LPC (linear Predictive Coding) and mostly used 
isolated words and letters though there were also a few phrases formed by concatena- 
tion. Simple texf^to-speech (TTS) engines based on specialised chips became popular 
on home compute^such as the BBC Micro in the UK and the Apple ][. 

Dennis Klatt's MITa^l? synthesizer [allen87] in many senses defined the perception 
of automatic speech s^^thesis to the world at large. Later developed into the prod- 
uct DECTalk, it produc^somewhat robotic, but very understandable, speech. It is a 
formant synthesizer, reflecting the state of the art at the time. 

Before 1980, research in speech synthesis was limited to the large laboratories that 
could afford to invest the tioSe.and money for hardware. By the mid-80s, more labs 
and universities started to jdm)in as the cost of the hardware dropped. By the late 
eighties, purely software synthesizers became feasible; the speech quality was still 
decidedly inhuman (and largely still is), but it could be generated in near real-time. 

Of course, with faster machines arin Varge disk space, people began to look to improv- 
ing synthesis by using larger, and more- varied inventories for concatenative speech. 
Yoshinori Sagisaka at Advanced Tel^ommunications Research (ATR) in Japan de- 
veloped nuu-talk [nuutalk92] in the lat^*80s and early 90s. It introduced a much 
larger inventory of concatenative units; thil^, instead of one example of each diphone 
unit, there could be many, and an automatic, acoustically based distance function 
was used to find the best selection of sub-wirfd units from a fairly broad database of 
general speech. This work was done in Japan^ste, which has a much simpler phonetic 
structure than English, making it possible to gjat>high quality with a relatively small 
databases. Even up through 1994, the time neesed to generate of the parameter files 
for a new voice in nuu-talk (503 senetences) was *>n the order of several days of CPU 
time, and synthesis was not generally possible in r£aj time. 

With the demonstration of general unit selection synthesis in English in Rob Donovan's 
PhD work [donovan95], and ATR's CHATR system ([^npbell96] and [hunt96]), by 
the end of the 90's, unit selection had become a hot topiH»>speech synthesis research. 
However, despite examples of it working excellently, vgerteralized unit selection is 
known for producing very bad quality synthesis from tirf^io time. As the optimial 
search and selection agorithms used are not 100% reliable, bar^k high and low quality 
synthesis is produced — and many dif filculties still exists in raining general corpora 
into high-quality synthesizers as of this writing. v>J 

Of course, the development of speech synthesis is not isolated irom other develop- 
ments in speech technology. Speech recognition, which has also benefited from the 
reduction in cost of computational power and increased availabilit)(j5£ general com- 
puting into the populace, informs a the work on speech synthesis, and vice versa. 
There are now many more people who have the computational resoucWand interest 
in running speech applications, and this ability to run such applicationsmJits the de- 
mand on the technology to deliver both working recognition and acceptable quality 
speech synthesis. 

The availability of free and semi-free synthesis systems, such as the Festival Speech 
Synthesis System and the MBROLA project, makes the cost of entering the field of 
speech synthesis much lower, and many more groups have now joined in the devel- 
opment. 



2 



Chapter 1. Overview of Speech Synthesis 

However, although we are now at the stage were talking computers are with us, there 
is still a great deal of work to be done. We can now build synthesizers of (probably) 
any language that can produce reconizable speech, with a sufficient amount of work; 
but if we are to use speech to receive information as easily when we're talking with 
computers as we do in everyday conversation, synthesized speech must be natural, 
controllable and efficient (both in rendering and in the building of new voices). 



Uses of Speech^Synthesis 

While speech^and language were already important parts of daily life before the in- 
vention of th^tomputer, the equipment and technology that has developed over the 
last several y^g£s has made it possible to have machines that speak, read, or even 
carry out dialogs. A number of vendors provide both recognition and speech tech- 
nology, and there a*e several telephone-based systems that do interesting things. 

General Anatomy of a Synthesizer 

[ diagram: text going in and mptang around, coming out audio ] 

Within Festival we can identify triple basic parts of the TTS process 
Text analysis: 

From raw text to identified words^n*d basic utterances. 

Linguistic analysis: v > 

Finding pronunciations of the words £md>0ssigning prosodic structure to them: 
phrasing, intonation and durations. \ 

<s> 

'Waveform generation: 

From a fully specified form (pronunciation an^^rosody) generate a waveform. 

These partitions are not absolute, but they are a good-tvay of chunking the problem. 
Of course, different waveform generation techniquesTviay need different types of 
information. Pronunciation may not always use standara^phones, and intonation need 
not necessarily mean an FO contour. For the main part, a^jjiong least the path which is 
likely to generate a working voice, rather than the more re^^rch oriented techniques 
described, the above three sections will be fairly cleanly adjppred to. 

There is another part to TTS which is normally not mention£ch we will mention it 
here as it is the most important aspect of Festival that makes bmWing of new voices 
possible — the system architecture. Festival provides a basic utterance structure, a lan- 
guage to manipulate it, and methods for construction and deletian; it also interacts 
with your audio system in an efficient way, spooling audio files whjfethe rest of the 
synthesis process can continue. With the Edinburgh Speech Tools, it offers basic anal- 
ysis tools (pitch trackers, classification and regression tree builders, Waveform I/O 
etc) and a simple but powerful scripting language. All of these functidfeimake it so 
that you may get on with the task of building a voice, rather than worrying about the 
underlying software too much. 



3 



Chapter 1. Overview of Speech Synthesis 



Text 

We try to model the voice independently of the meaning, with machine learning 
techniques and statistical methods. This is an important abstraction, as it moves us 
from the realm of "all human thought" to "all possible sequences." Rather than ask- 
ing "when and why should this be said," we ask "how is this performed, as a series 
of speech sounds?" In general, we'll discuss this under the heading of text analysis 
— going from written text, possibly with some mark-up, to a set of words and their 
relationships in an internal representation, called an utterance structure. 

Text analygte is the task of identifying the zvords in the text. By words, we mean tokens 
for which m&re is a well defined method of finding their pronunciation, i.e. from a 
lexicon, or uljjise letter-to-sound rules. The first task in text analysis is to make chunks 
out of the inpmjext — tokenizing it. In Festival, at this stage, we also chunk the text 
into more reasonably sized utterances. An utterance structure is used to hold the 
information for what might most simply be described as a sentence. We use the term 
loosely, as it need^yx be anything syntactic in the traditional linguistic sense, though 
it most often has praSadic boundaries or edge effects. Separating a text into utterances 
is important, as it allows synthesis to work bit by bit, allowing the waveform of the 
first utterance to be a\*aua>ble more quickly than if the whole files was processed as 
one. Otherwise, one wcJylcl simply play an entire recorded utterance — which is not 
nearly as flexible, and in ^S^ie domains is even impossible. 

Utterance chunking is an e^^rnally specifiable part of Festival, as it may vary from 
language to language. For ma*tj& languages, tokens are white-space separated and ut- 
terances can, to a first approximation, be separated after full stops (periods), question 
marks, or exclamation points, further complications, such as abbreviations, other- 
end punctuation (as the upside-dojirn question mark in Spanish), blank lines and so 
on, make the definition harder. Fo» languages such as Japanese and Chinese, where 
white space is not normally used toserarate what we would term words, a different 
strategy must be used, though bothQhese languages still use punctuation that can 
be used to identify utterance boundar^, and word segmentation can be a second 
process. ^\ 

Apart from chunking, text analysis also d0es, text normalization. There are many to- 
kens which appear in text that do not havea^irect relationship to their pronuncia- 
tion. Numbers are perhaps the most obvious- example. Consider the following sen- 
tence QO 

On May 5 1996, the university bought 1996 computei 1 ^^ 

In English, tokens consisting of solely digits have^f'mimber of different forms of 
pronunciation. The "5" above is pronounced "fifth" , aiQrdinal, because it is the day 
in a month, The first "1996" is pronounced as "nineteen nfiMy six" because it is a year, 
and the second "2996" is pronounced as "one thousandStmeJiundred and ninety size" 
(British English) as it is a quantity. V) K 




Two problems that turn up here: non-trivial relationship of fbKens to words, and ho- 
mographs, where the same token may have alternate pronunciations in different con- 
texts. In Festival, homograph disambiguation is considered as p^efrt of text analysis. 
In addition to numbers, there are many other symbols which have internal structure 
that require special processing — such as money, times, addressed, etc. All of these 
can be dealt with in Festival by what is termed token-to-word ride\ These are lan- 
guage specific (and sometimes text mode specific). Detailed example^j)/ill be given 
in the text analysis chapter below. 



Lexicons 

After we have a set of words to be spoken, we have to decide what the sounds should 
be — what phonemes, or basic speech sounds, are spoken. Each language and dialect 
has a phoneme set associated with it, and the choice of this inventory is still not 



4 



Chapter 1. Overview of Speech Synthesis 

agreed upon; different theories posit different feature geometries. Given a set of units, 
we can, once again, train models from them, but it is up to linguistics (and practice) 
to help us find good levels of structure and the units at each. 



Prosody 

Prosody, or the way things are spoken, is an extremely important part of the speech 
message. Changing the placement of emphasis in a sentence can change the meaning 
of a word /<!: ^nd this emphasis might be revealed as a change in pitch, volume, voice 
quality, or fmung. 

We'll presenrpvo approaches to taming the prosodic beast: limiting the domain to 
be spoken, anc&htonation modeling. By limiting the domain, we can collect enough 
data to cover th§ whole output. For some things, like weather or stock quotes, very 
high quality can hefgroduced, since these are rather contained. For general synthesis, 
however, we need ixy^ able to turn any text, or perhaps concept, into a spoken form, 
and we can never collect all the sentences anyone could ever say. To handle this, we 
break the prosody inD^i set of features, which we predict using statistically trained 
models. 

- phrasing - duration - inflation - energy - voice quality 

% 

Waveform generation 

For the case of concatenative synthesis, we actually collect recordings of voice tal- 
ent, and this captures the voice qwality to some degree. This way, we avoid detailed 
physical simulation of the oral trac^nd perform synthesis by integrating pieces that 
we have in our inventory; as we doi^Khave to produce the precisely controlled ar- 
ticulatory motion, we can model the speech using the units available in the sound 
alone — though these are the surface realisation of an underlying, physically gener- 
ated signal, and knowledge of that syste^n informs what we do. During waveform 
generation, the system assembles the unitsmjto an audio file or stream, and that can 
be finally "spoken." There can be some dismrfi^on as these units are joined together, 
but the results can also be quite good. ^ _ 

We systematically collect the units, in all variations, so as to be able to reproduce 
them later as needed. To do this, we design a ser*ofutterances that contain all of the 
variation that produces meaningful or apparent cc^^ast in the language, and record 
it. Of course, this requires a theory of how to breaksp^eech into relevant parts and 
their associated features; various linguistic theories predict these for us, though none 
are undisputed. There are several different possible iWt inventories, and each has 
tradeoffs, in terms of size, speed, and quality; we will di^pkss these in some detail. 

o 

% 



5 



Chapter 1. Overview of Speech Synthesis 




6 



Chapter 2. Speech Science 

speech, articulators, formats, phones, syllables, utterances, intonation etc. 
generating speech, formant, concatenative, 




7 



Chapter 2. Speech Science 




8 



Chapter 3. A Practical Speech Synthesis System 



The Festival Speech Synthesis Systems 1 was developed at the Centre for Speech Tech- 
nology Reseach 2 at the University of Edinburgh 3 in the late 90's. It offers a free, 
portable, language independent, run-time speech synthesis engine for verious plat- 
forms under various APIs. This book is not about the Festival system itself, Festival 
is just the engine that we will use in the process of building voices, both as a run-time 
engine for the voices we build and as a tool in the building process itself. This chapter 
gives a background on the philosophy of the system, its basic use, and some lower 
level detail^ on its internals that will make the understanding of the whole synthesis 
task easierT/K 

The Festival Sgjfeech Synthesis System was designed to target three particular classes 
of speech syntKelis user. 
• 

1. Speech syntSi^sis > researchers: where they may use Festival as a vehicle for de- 
velopmeent aweutesting of new research in synthesis technology. 

2. Speech applicat55ji developers: where synthesis is not the primary interest, but 
Festival will be a ^tffsstantial sub-component which may require significant in- 
tegration and henrtfjhe system must be open and easily configurable. 

3. End user: where the safsiem simple takes text and generates speech, requiring 
no or very little configuration from the user. 

In the design of Festival it was important that all three classes of user were served as 
there needs to be a clear route froiw research work to practial usable systems as this 
not only encourages research to be trussed but also, as has been shown by the large 
uptake of the system, ensures there is^ large user community interested in seeing 
improvements to the system. 

The Festival Speech Synthesis System was^uilt based on the experience of previous 
synthesis engines. Design of a key architectjire is important as what may seem gen- 
eral to begin with can quickly become a liRjifcng factor, as new and more ambitious 
techniques are attempted within it. The basic Architecture of Festival benefited mainly 
from previous synthesis engines developed at Edinburgh University, specifically Os- 
prey [taylor91]. ATR's CHATR system, [black9*fwas also a major influence on Festi- 
val, CHATR's original core architecture was also developed by the same authors as 
Festival. In designing Festival, the intention was ttflh/oid the previous limitations in 
the utterance representation and module specificatipp^specifically in avoiding con- 
straints on the types of modules and dependencies^beiween them. However even 
with this intent, Festival went through a number of cosex:hanges before it settled. 

The Festival system consists of a set of C++ objects and "c^*^ methods suitable for do- 
ing synthesis tasks. These objects include synthesis specift^febjects like, waveforms, 
tracks and utterances as well as more general objects like fepxire sets, n-grams, and 
decision trees. ^ 

In order to give parameters and specify flow of control FestivafoiS^rs a scripting lan- 
guage based on the Scheme programming language [Scheme96]s Having a scripting 
language is one of the key factors that makes Festival a useful system. Most of the 
techniques in this book for building new voices within Festival can(Ee)done without 
any changes to the core C++ objects. This makes development of new voices not only 
more accessible to a larger population or users, as C++ knowledge nco C++ com- 
piler is necessary, it also makes the distribution of voices built by thes^Jechniques 
easy as users do not require any recompilation to use newly created voices. 

Scheme offers a very simple syntax but powerful language for specifying parameters 
and simple functions. Scheme was chosen as its implementation is small and would 
not increase the size of the Festival system unnecessarily. Also, using an embedded 
Scheme component does not increase the requirements for installation as would the 
use of say Java, Perl or Python as the scripting language. Scheme frightens some 



9 



Chapter 3. A Practical Speech Synthesis System 



people as Lisp based languages have an unfair reputation for being slow. Festival's 
use of Scheme is (in general) limited to simple functions and very little time is spent 
in the Scheme interpreter itself. Automatic garbage collection also has a reputation 
for slowing systems down. In Festival, garbage collection happens after each utter- 
ance is synthesized and again takes up only a small amount of time but allows the 
programmer nor to have to worry about explicitly freeing memory. 

For the most part the third type of user, defined above, will never need to change any 
part of the systems (though they usually find something they want to change, like 
adding new entries to the lexicon). The second level of user typically does most of 
their customizing in Scheme, though this is usually just modifying existing pieces of 
Scheme in th^way that people may add simple lines of Lisp to their . emacs file. It is 
primarily onw^khe synthesis research community that has to deal with the C++ end 
of the system, e&feugh C / C++ interfaces to the systems as a library are also provided 
(see Chapter 24 £or more discussions on APIs). 

This chapter cove^^the basic use of the system and is followed by more details of 
the internal structwra, particularly the utterance sturcture, accessing methods and 
modules. These latefsections are probabaly more detail than one needs for building 
standard voices descrn^ecbin the book, but the is information is necessary when more 
ambituous voice buildiKgXasks are attempted. 



Basic Use 



The examples here are given bas^d on a standard installation on a Unix system as 
described in Chapter 29, however the examples are likely to work under any platform 
Festival supports. \) 

The most simple way to use FestivajJtbspeak a file from the command line, is by the 
command 

festival —tts example.txt 

This will speak the text in example . txt u^Ute the default voice. 
Festival can also read text from stdin using a. command like 

<s> 

echo "Hello world" I festival —tts % 

Is 

Festival actually offers two modes, a text mode and^a^eqmmand mode. In text mode 
everything given to Festival is treated as text to be spoken. In comamnd mode every- 
thing is treated as Scheme commands and interpreted. 

When festival is started with no arguments if goes into irfi^active command mode. 
There you may type Scheme command and have Festival in|^jret them. For example 

$ festival 

festival> , 

One simple command is SayText takes a single string argument ana says its con- 
tents, o 

festival> (SayText "Hello world.") 
#<Utterance 0x402a9ce8> 
festival> 

You may select other voices for synthesis by calling the appropriate function. For 
example 



10 



Chapter 3. A Practical Speech Synthesis System 



festival> (voice_cmu_sls_diphone) 

cmu_us_sls_diphone 

festival> (Say Text "Hello world.") 

#<Utterance 0x402f0648> 

festival> 



Will use a female US English voice (if installed). 

The command line interface offers comand line history though the up and down 
arrows (ctrl-P and ctrl-N) and editing through standard emacs-like commands. Im- 
portantly the interface does function and filename completion too, using the TAB 
key. *\ 

Any Scheme't^mmand may be typed at the command line for example 



festival> (Parameter. set 'Duration_Stretch 1.5) 

V. 



festival> 



Will make all duratid^ longer for the current voice (making the voice speak slower. 



festival> (SayText "a very atew example.") 
#<Utterance 0x402f56437&>^ 
festival> ^ 

Calling any specific voice wilnpset this value (or you may do it by hand). 

festival> (voice_cmu_us_kal_diprir^pi) 
cmu_us_kal_diphone -r C\ 

festival> (SayText "a normal examplerl^ 
#<Utterance 0x402e3348> C 
festival> * 

<!> 

The SayText is just a simple function thantakes the given string, constructs an ut- 
terance object from is, synthesizes it and sertfqs the resulting waveform to the audio 
device. This isn't really suitable for synthesizir^anythign but very short utterances. 
The TTS process involves the more complex task of splitting text streams into ut- 
terance synthesizing them and send them to the* audio device to they may play as 
the same time working on the next utterance to tnsiyhe audio output is continuous. 
Festival does this through the tts function (whichv^what is actually called when 
Festival is given the — tts argument on the command/ime. In Scheme the tts funci- 
ton takes two arguments, a filename and a mode. Modes^re described in more detail 
in the Section called TTS modes in Chapter 6, and can be^dsed to allow special pro- 
cessing of text, such as respecting markup or particular sty^i of text like email etc. In 
simple case the mode will be nil which denotes the basic ra^or fundamental mode. 

festival> (tts "WarandPeace.txt" nil) v 
festival> 

o 

Commands can also be stored in files, which is normal when a numbe^of function 
definitions and parameter settings are required. These scheme files carfpploaded by 
the function SayText as in 



festival> (load "commands. son") 
t 

festival> 



11 



Chapter 3. A Practical Speech Synthesis System 



Arguments to Festival at startup time will normally be treated as command files and 
loaded. 



$ festival commands. scm 
festival> 

However if the argument starts with a left parenthesis ( the argument is interpreted 
directly as a Scheme command. 

$ festival T^yText "a short example.")' 
festival> 

If the -b (batch") option is specified Festival does not go into interactive mode and 
exits after processing all of the given arguments. 

V 

$ festival -b mynewv^ri^defs.scm '(SayText "a short example.")' 

Thus we can use FestivaNrtteractively or simple as a batch scripting language. The 
batch format will be used o(jeh in the voice building process though the intereactive 
mode is useful for testing new^oices. 

\ 

Utterance structure ^ 

The basic building block for FestivaL-iK the utterance. The structure consists of a set 
of relations over a set of items. Each item, represents a object such as a word, seg- 
ment, syllable, etc. while relations relate these items together. An item may appear 
in multiple relations, such as a segment t^iM be in a Segment relation and also in the 
SylStructure relation. Relations define aw^ordered structure over the items within 
them, in general these may be arbitrary graphs but in practice so far we have only 
used lists and trees Items may contain a number of features. 

There are no built-in relations in Festival and tri^iames and use of them is controlled 
by the particular modules used to do synthesis. Language, voice and module specific 
relations can easy be created and manipulated. Herwtver within our basic voices we 
have followed a number of conventions that shoulcfh£ followed if you wish to use 
some of the existing modules. \ 

The relation names used will depend on the particurax-structure chosen for your 
voice. So far most of our released voices have the samevbasic structure though some 
of our research voices contain quite a different set of relarjc^s. For our basic English 
voices the relations used are as follows 

Text 

Contains a single item which contains a feature with the input character string 
that is being synthesized * 

Token Q 

A list of trees where each root of each tree is the white space separated tok- 
enized object from the input character string. Punctuation and whrfespace has 
been stripped and placed on features on these token items. The daughters of 
each of these roots are the list of words that the token is associated with. In many 
cases this is a one to one relationship, but in general it is one to zero or more. For 
example tokens comprising of digits will typically be associated with a number 
of words. 



12 



Chapter 3. A Practical Speech Synthesis System 



Word 



The words in the utterance. By word we typically mean something that can be 
given a pronunciation from a lexicon (or letter-to-sound rules). However in most 
of our voices we distinguish pronunciation by the words and a part of speech 
feature. Words with also be leaves of the Token relation, leaves of the Phrase 
relation and roots of the SylStructure relation. 



Phrase 



A simple list of trees representing the prosodic phrasing on the utterance. In our 
voicesAfte only have one level of prosodic phrase below the utterance (though 
you carf^asily add a deeper hierarchy if your models require it). The tree roots 
are labekgtwith the phrase type and the leaves of these trees are in the word 
relation. 



Syllable \§\ 

A simple list of^yllable items. These syllable items are intermediate nodes in 
the SylStructuya/Jelation allowing access to the words these syllables are in 
and the segments tna^are in these syllables. In this format no further onset/coda 
distinction is made explicit but can be derived from this information. 

Segment 

A simple list of segsffkvt (phone) items. These form the leaves of the 
SylStructure relation through which we can find where each segment is 
placed within its syllableXand word. By convention silence phones do not 
appear in any syllable (or wSra) but will exist in the segment relation. 

SylStructure 

A list of tree structures over the ife^rr^»in the word, Syllable and segment items. 
IntEvent 

A simple list of intonation events (accepts and boundaries). These are related to 
syllables through the intonation relati^fiS 

Intonation 

A list of trees whose roots are items in the salable relation, and daughters 
are in the IntEvent relation. It is assumed trtaj/ a syllable may have a number 
of intonation events associated with it (at least ■aetfents and boundaries), but an 
intonation event may only by associated with one^dlable. 

Wave 

A relation consisting of a single item that has a fecJtXre with the synthesized 
waveform. 

Target 

A list of trees whose roots are segments and daughters are PO target points. This 
is only used by some intonation modules. • 

o 

Unit, SourceSegments, Frames, SourceCoef TargetCoef 
A number of relations used the the UniSyn module. 



23 



Chapter 3. A Practical Speech Synthesis System 



Modules 



The basic synthesis process in Festival is viewed as applying a set of modules to an 
utterance. Each module will access various relations and items and potentially gener- 
ate new features, items and relations. Thus as the modules are applied the utterance 
structure is filled in with more and more relations until ultimately the waveform is 
generated. 

Modules may be written in C++ or Scheme. Which modules are executed are defined 
in terms of the utterance type, a simple feature on the utterance itself. For most text- 
to-speech gdses this is defined to be of type Tokens. The function utt . synth simply 
looks up arrtjkterance's type and then looks up the definition of the defined synthesis 
process for dj&Mype and applies the named modules. Synthesis types maybe defined 
using the funoaon def uttType. For example definition for utterances of type Tokens 
is 



(defUttType TokeVp> 
(Token_POS utt) 
(Token utt) p/\ 
(POS utt) Y\ 
(Phrasify utt) 
(Word utt) 
(Pauses utt) 

(Intonation utt) QO 
(PostLex utt) 
(Duration utt) 

(Int_Targets utt) \ 
(Wave_Synth utt) v"*) 

: " \ 

While a simpler case is when the irfput is phone names and we don't wish to do 
all that text analysis and prosody prediffrion. Then we use the type Phones which 
simply loads the phones, applies fixed prosody and the synthesizes the waveform 

(defUttType Phones <^> 
(Initialize utt) 

(Fixed_Prosody utt) .r-v 
(Wave_Synth utt) QO 

In general the modules named in the type definitibrts a^e general and actually allow 
further selection of more specific modules within them. For example the Duration 
module respects the global parameter Duration_MetjO(i and will call then desired 
duration module depending on this value. ^-n 

When building a new voice you will probably not need roctrange any of these defini- 
tions, though you may wish to add a new module and we^wj^l show how to do that 
without requiring any change to the synthesis definitions Jfi^Jater chapter. 

There are many modules in the system, some simply wraparoiinelsto choose between 
other modules. However the basic modules used for text-to-sj^eech have the basic 
following function 

"o 

Token_POS 

basic token identification, used for homograph disambiguation 

Token 

Apply the token to word rules building the Word relation. 

POS 

A standard part of speech tagger (if desired) 



14 



Chapter 3. A Practical Speech Synthesis System 



Phrasif y 



Build the Phrase relation using the specified method. Various are offered, from 
statistically trained models to simple CART trees. 

Word 

Lexical look up building the Syllable and Segment relations and the 
SylStructure related these together. 

Pauses a 

Prediction of pauses, inserting silence into the Segment relation, again through 
a choicev&fdifferent prediction mechanisms. 

Intonation 

• 

Prediction of accents and boundaries, building the intEvent relation and the 
intonation remfefon that links IntEvents to syllables. This can easily be parame- 
terized for mostvpractical intonation theories. 

% 

PostLex "<\ 

Post lexicon rules thriT&an modify segments based on their context. This is used 
for things like vowel reduction, contractions, etc. 

Prediction of durations of Segments. 



Duration 



Int_Targets 
The seco 

desired FO contour. 



creates the Target relation representing the 

Wave_Synth S\ 

A rather general function that in turi^c^lls the appropriate method to actually 
generate the waveform. 

vs 

Utterance access ^> 

A set of simple access methods exist for utterances, /relations, items and features, 
both in Scheme and C++. As much as possible these access methods are as similar as 
possible. \>S^ 

As the users of this document will primarily be accessingv£Wrance via Scheme we 
will describe the basic Scheme functions available for acceS<@d give some examples 
of idioms to achieve various standard functions. 

In general the required arguments to a lisp function are reflected^! the first parts of 
the name of the function. Thus item, relation . next requires an item, and relation 
name and will return the next item in that named relation from the given one. 

A listing a short description of the major utterance access and manipulation functions 
is given in the Festival manual. U, 



An important notion to be aware of is that an item is always viewecVrhrough so 
particular relation. For example, assuming a typically utterance called uttl. 

(set! segl (utt.relation.first uttl 'Segment)) 

segl is an item viewed from the Segment relation. Calling item, next on this will 
return the next item in the Segment relation. A Segment item may also be in the 

15 



Chapter 3. A Practical Speech Synthesis System 



SylStructure item. If we traverse it using next in that relation we will hit the end 
when we come to the end of the segments in that syllable. 

You may view a given item from a specified relation by requesting a view from 
that. In Scheme nil will be returned if the item is not in the relation. The function 
item, relation takes an item and relation name and returns the item as view from 
that relation. 

Here is a short example to help illustrate the basic structure. 



(set! uttl <! ^tt.synth (Utterance Text "A short example."))) 
The first segment in utt ! will be silence. 

(set! segl (utt.rejation.first uttl 'Segment)) 
This item will be^silence as can shown by 

(item.name segl) 

If we find the next iterrtfC^e will get the schwa representing the indefinite article. 

(set! seg2 (item.next segl)) ^0 
(item.name seg2) 

Let us move onto the "sh" to iftus^ate the different between traversing the Segment 
relation as opposed to the Sy 1st rupture 

(set! seg3 (item.next seg2)) 

Let use define a function which will f^e^n item, print its name name call next on 
it in the same relation and continue until itffeaches the end. 

(define (toend item) 
(if item \ 
(begin (^) 
(print (item.name item)) 

(toend (item.next item))))) * > 

If we call this function on seg3 which is in the Segme#tt relation we will get a list of 
all segments until the end of the utterance ' 

festival> (toend seg3) 



00 
pi , pi 



56. 



t' 

o 



aa 
"m" 

pi " 



o 

I ^ 

nil 

festival> 

However if we first changed the view of seg3 to the SylStructure relation we will 
be traversing the leaf nodes of the syllable structure tree which will terminate at the 
end of that syllable. 



16 



Chapter 3. A Practical Speech Synthesis System 



festival> (toend (item.relation seg3 'SylStructure) 

"sh" 

"oo" 

"t" 

nil 

festival> 

Note that item . next returns the item immediately to the next in that relation. Thus 
it return nil when the end of a sub-tree is found, item, next is most often used for 
traversing simple lists through it is defined for any of the structure supported by 
relations, function item . next_item allows traversal of any relation returning a 
next item unjjl it has visiting them all. In the simple list case this this equivalent to 
item . next mpin the tree case it will traverse the tree in pre-order that is it will visit 
roots before tr®* daughters, and before their next siblings. 

Scheme is particularly adept at using functions as first class objects. A typical traver- 
sal idiom is to app^ly^so function to each item in a a relation. For example support we 
have a function PredityDiiration which takes a single item and assigns a duration. We 
can apply this to eactutem in the Segment relation 



3. 



(mapcar 
PredictD 

(utt.relation.items uttl 'Segieent)) 



PredictDuration 



The function utt . relatioiT^tems returns all items in the relation as a simple lisp 
list. y>* 

Another method to traverse the ifcems in a relation is use the while looping paradigm 
which many people are more famlUai: with. 

(let ((f (utt.relation.first uttl 'Segment)}^ f 
(while f 



(PredictDuration f) * 



(set! f (item.next_item f)))) 



6 

If you wish to traverse only the leaves of a tre^'ou may call utt . relation . leafs 
instead of utt . relation . items. A leaf is defined to be an item with no daughters. 
Or in the while case, there isn't standardly define?! aitem . next_leaf but code easily 
be defined as \J 

X 

(define (item.next_leaf i) 
(let ((n (item.next_item i))) 
(cond 

((null n) nil) ^ 



((item.daughters n) (item.next_leaf n)) 
(tn)))) 



<6 



Features as pathnames Q 

Rather than explicitly calling a set of functions to find your way rouri^an utterance 
we also allow access through a linear flat pathname mechanism. This nrfethanism is 
read-only but can succinctly access not just features on a given item bulrfeatures on 
related items too. 

For example rather than calling an explicit next function to find the name of the fol- 
lowing item thus 

(item.name (item.next i)) 

27 



Chapter 3. A Practical Speech Synthesis System 



You can access it via the pathname 
(item.feat i "n.name") 

Festival will interpret the feature name as a pathname. In addition to traversing the 
current relation you can switch between relations via the element R-.relationname. 
Thus to find the stress value of an segment item seg we need to switch to the 
SylStructure relation, find its parent and check the stress feature value. 



(item.feaMjjg "R:SylStructure.parent.stress") 




Feature patftnames make the definition of various prediction models much easier. 
CART trees fof^aample simply specify a pathname as a feature, dumping features for 
training is also ^ simple task. Full function access is still useful when manipulation 
of the data is required but as most access is simply to find values pathnames are the 
most efficient way><o^ccess information in an utterance. 

Access idioms O 

For example suppose you^ish to traverse each segment in an utterance replace all 
vowels in unstressed syllab{§^)with a schwa (a rather over-aggressive reduction strat- 

~— m - 

(define (reduce_vowels utt) 
(mapcar ^ 
(lambda (segment) -r C\ 

(if (and (string-equal "+" (item.feat segvient "ph_vc")) 
(string-equal v 
"1" (item.feat segment "R:SylStrurti£re.parent.stress"))) 
(item.set_name segment "@"))) J( 
(utt.relation.items 'Segment))) \ 

Utterance building x5 

As well as using Utterance structures in the actual runtime process of converting text- 
to-speech we also use them in database representatic@ Basically we wish to build 
utterance structures for each utterance in a speech database. Once they are in that 
structure, as if they had been (correctly) synthesized, w^?aii use these structures for 
training various models. For example given the actually durations for the segments 
in a speech database and utterance structures for these we<t@ dump the actual du- 
rations and features (phonetic, prosodic context etc.) which w^feel influence the du- 
rations and train models on that data. 

Obviously real speech isn't as clean as synthesized speech so its not always easy to 
build (reasonably) accurate utterances for the real utterances. However here we will 
itemize a number of functions that will make the building of utt^rince from real 
speech easier. Building utterance structures is probably worth the effort) considering 
how easy it is to build various models from them. Thus we recomm^ this even 
though at first the work may not immediately seem worthwhile. 

In order to build an utterance of the type used for our English voices (and which is 
suitable for most of the other languages we have done), you will need label files for 
the following relations. Below we will discuss how to get these labels, automatically, 
by hand or derived from other label files in this list and the relative merits of such 
derivations. 



18 



Chapter 3. A Practical Speech Synthesis System 



The basic label types required are 

Segment 

segment labels with (near) correct boundaries, in the phone set of your language. 

Syllable 

Syllables, with stress marking (if appropriate) whose boundaries are closely 
aligned with the segment boundaries. 

Word 

Words wtp* boundaries aligned (close) to the syllables and segments. By words 
we mean tra things which can be looked up in a lexicon thus "1986" would not 
be considered a word and should be rendered as three words "nineteen eighty 
six". ^ 



V. 



IntEvent 

Intonation labels Q^gped to a syllable (either within the syllable boundary or ex- 
plicitly naming the^syllable they should align to. If using ToBI (or some deriva- 
tive) these would bq^tandard ToBI labels, while in something like Tilt these 
would be "a" and "V iwarking accents and labels. 

Phrase 

A name and marking for th^tend of each prosodic phrase. 

Target ' /■>. 

The mean FO value in Hertz at trp^nid-point of each segment in the utterance. 

Segment labels are probably the hardest tergenerate. Knowing what phones are there 
can only really be done by actually listemh&to the examples and labeling them. Any 
automatic method will have to make lowSk?Wl phonetic classifications which ma- 
chines are not particularly good at (nor are humans for that matter). Some discussion 
of autoaligning phones is given in the diphon^toapter where an aligner distributed 
with this document is described. This may help out as much depends on the segmen- 
tal accuracy getting it right ultimately hand correction at least is required. We have 
used that aligner on a speech database though w^jiready knew from another (not 
so accurate) aligner what the phone sequences probably were. Our aligner improved 
the quality of exist labels and the synthesizer (phonebex) that used it, but there are 
external conditions that made this a reasonably thing ro-ao. 

Word labeling can most easily be done by hand, it is much easier than to do than 
segment labeling. In the continuing process of trying to bkjltl automatic labelers for 
databases we currently reckon that word labeling could b^me last to be done au- 
tomatically. Basically because with word labeling, segment, syllable and intonation 
labeling becomes a much more constrained task. However it iVmtportant that word 
labels properly align with segment labels even when spectrally were may not be any 
real boundary between words in continuous speech. • 

Syllable labeling can probably best be done automatically given seg^^nt (and word) 
labeling. The actual algorithm for syllabification may change but wha(teyer is chosen 
(or defined from a lexicon) it is important that that syllabification is consistently used 
throughout the rest of the system (e.g. in duration modeling). Note thcJPautomatic 
techniques in aligning lexical specifications of syllabification are in their nature inex- 
act. There are multiple acceptable ways to say words and it is relatively important to 
ensure that the labeling reflects what is actually there. That is simply looking up a 
word in a lexicon and aligning those phones to the signal is not necessarily correct. 
Ultimately this is what we would like to do but so far we have discovered our unit 
selection algorithms are nowhere near robust enough to do this. 

19 



Chapter 3. A Practical Speech Synthesis System 



The Target labeling required here is a single average FO value for each segment. This 
currently is done fully automatically from the signal. This is naive and a better rep- 
resentation of FO could be more appropriate, it is used only in some of the model 
building described below. Ultimately it would be good if the FO need not be explic- 
itly used at all but just use the factors that determine the FO value, but this is still a 
research topic. 

Phrases could potentially be determined by a combination of FO power and silence 
detection but the relationship is not obvious. In general we hand label phrases as 
part of the intonation labeling process. Realistically only two levels of phrasing can 
reliably be4g£>eled, even though there are probably more. That is, roughly, sentence 
internal and^ntence final, what ToBI would label as (2 or 3) and 4. More exact label- 
ings would be^seful. 

For intonation events we have more recently been using Tilt accent labeling. This is 
simpler than ToBI and we feel more reliable. The hand labeling part marks a (for ac- 
cent) and b for b^i^id^ry. We have also split boundaries into rb (rising boundary) 
and f b (falling bouijcfsry). We have been experimenting with autolabeling these and 
have had some succe^k but that's still a research issue. Because there is a well defined 
and fully automatic rr^erljisd of going from a/b labeled waveforms to a parameteri- 
zation of the FO contou¥3^e've found Tilt the most useful Intonation labeling. Tilt is 
described in [taylorOOa]. ^ 

ToBI accent /tone labeling [^^erman92] is useful too but time consuming to label. If 
it exists for the database ther^ifej usually worth using. 

In the standard Festival^ 'distribution there is a festival script 
f estival/examples/make_uttfsjayhich will build utterance structures from the 
labels for the six basic relations. ^ 

This function can most easily be us^s iven the following directory/ file structure in 
the database directory, f estival/reXjitions/ should contain a directory for each set 
of labels named for the utterance relatioj^it is to be part of (e.g. Segment/, word/, etc. 

The constructed utterances will be saveddrr-f estival/utts/. 

Extracting features from utterances ^ 

Many of the training techniques that are described in the following chapters ex- 
tract basic features (via pathnames) from a set of^iterances. This can most easily 
be done by the festival/examples/dumpfeats FWtiwl script. It takes a list of fea- 
ture/pathnames, as a list or from a file and saves the Values for a given set of items in 
a single feature file (or one for each utterance). Call f e^t^.val/examples/dumpf eats 
with the argument -h for more details. \Q* 

For example suppose for all utterances we want the segmert duration, its name, the 
name of the segment preceding it and the segment followrhj^ij. 

dumpfeats -feats '(segment_duration name p.name n.name)' \ 
-relation Segment -output dur.feats festival/ utts/*.utt S\ 

If you wish to save the features in separate files one for each utterance, if the output 
filename contains a "%s" it will be filled in with the utterance filei^T^lThus to dump 
all features named in the file duration . f eatnames we would call 



dumpfeats -feats duration.featnames -relation Segment \ 
-output feats/%s.dur festival/utts/*.utt 



The file duration . f eatnames should contain the features/pathnames one per line 
(without the opening and closing parenthesis. 



20 



Chapter 3. A Practical Speech Synthesis System 



Other features and other specific code (e.g. selecting a voice that uses an appropriate 
phone set), can be included in this process by naming a scheme file with the -eval 
option. 

The dumped feature files consist of a line for each item in the named relation con- 
taining the requested feature values white space separated. For example 

0.399028 pau 0 sh 
0.08243 sh pau iy 
0.07458 iy sh hh 
0.0480844gyy ae 
0.062803 aeith d 




0.08208 ax y r 
0.036936 r ax d* 
0.036935 draa \$\ 
0.081057 aa d r ^ 



Notes 



2. 



3 



1 



http:/ / www.cstr.ed. ac.ulS^projects/ festival/ 
http://www.cstr.ed.ac.uk \^ 
http://www.ed.ac.uk/ v ^ 






o 





21 



Chapter 3. A Practical Speech Synthesis System 




22 



Chapter 4. Basic Requirements 



This section identifies the basic requirements for building a voice in a new language, 
and adding a new voice in a language already supported by Festival. 



Hardware/software requirements 

Because we are most familiar with a Unix environment the scripts, tools etc. assume 
such a basi£ environment. This is not to say you couldn't run these scripts on other 
platforms as^many of these tools are supported on platforms like WIN32, its just 
that in our rfgfcinal work environment, Unix is ubiquitous and we like working in it. 
Festival also riWLg on Win32 platforms. 



Much of the testing was done under Linux; wherever possible, we are using freely 
available tools. We are happy to say that no non-free tools are required to build voices, 
and we have incluered^citations and / or links to everything needed in this document. 

We assume Festival and the Edinburgh Speech Tools 1.2.3. 

Note that we make an la^nsive use of the Speech Tools programs, and you will need 
the full distribution of #fem as well as Festival, rather than the run-time (binary) 
only versions which are available for some Linux platforms. If you find the task of 
compiling Festival and the ^)eech tools daunting, you will probably find the rest of 
the tasks specified in this doc^wtnent more so. However, it is not necessary for you to 
have any knowledge of C++ tennake voices, though familiarity with text processing 
techniques (e.g. awk, sed, perl)^Cvill make understanding the examples given much 
easier. v^) 

We also assume a basic knowledge^J Festival, and of speech processing in general. 
We expect the reader to be familiar wc^h basic terms such as FO, phoneme, and cep- 
strum, but not in any real detail. References to general texts are given (when we know 
them to exist). A basic knowledge of prisSgrarnming in Scheme (and/or Lisp) will also 
make things easier. A basic capability in j5rogramming in general will make defining 
rules, etc., much easier. 



If you are going to record your own databases you will need recording equipment: 
the higher quality, the better. A proper recordifts studio is ideal, though may not be 
available for everyone. A cheap microphone srack on the back of standard PC is not 
ideal, though we know most of you will end up dojng that. A high quality sound 
board, close-talking, high quality microphone artffSa nearly soundproof recording 
environment will often be the compromise betweerTthgise two extremes. 

Many of the techniques described in here require a rair\ amount of processing time 
to achieve, though machines are indeed getting faster aYfpJhis is becoming less of an 
issue. If you use the provided aligner for labeling diphgpes you will need a proces- 
sor of reasonable speed, likewise for the various training'^t^ithniques for intonation, 
duration modeling and letter-to-sound rules. Nothing presetted here takes weeks 
though a number of processes may be over-night jobs, depending on the speed of 
your machine, and size of your database. \J 

Also we think that you will need a little patience. The process of building a voice is 
not necessarily going to work first time. It may even fail completely, so if you don't 
expect anything special, you wont be disappointed. 

% 

Voice in a new language V 

The following list is a basic check list of the core areas you will need to provide pieces 
for. You may, in some cases, get away with very simple solutions (e.g. fixed phone 
durations), or be able to borrow from other voices /languages, but whatever you end 
up doing, you will need to provide something for each part. 

You will need to define 

23 



Chapter 4. Basic Requirements 



• Phone set 

• Token processing rules (numbers etc) 

• Prosodic phrasing method 

• Word pronunciation (lexicon and /or letter-to-sound rules) 

• Intonation (accents and FO contour) 

• Durations 

• Waveform synthesizer 

\ 

Voice in an existing language 

The most commorvSjae is when someone wants to make their own voice into a syn- 
thesizer. Note that reissues in voice modeling of a particular speaker are still open 
research problems. Mp£h of the quality of a particular voice comes mostly from the 
waveform generation r^effiod, but other aspects of a speaker such as intonation and 
duration, and pronunciafupn are all part of what makes that person's voice sound 
like them. All of the genfepal-purpose voices we have heard in Festival sound like 
the speaker they were reco^)from (at least as far as we know all the speakers), but 
they also don't have all the q^lities of that person's voice, though they can be quite 
convincing for limited-domau^synthesizers. 

As a practical recommendation roanake a new speaker in an existing supported lan- 
guage, you will need to consider ^ 

• Waveform synthesis 

• Speaker specific intonation V* 

• Speaker specific duration \ , 



Chapter 20 deals with specifically building<^aew US or UK English voice. This is 
a relatively easy place to start, though of c<$jarse we encourage reading this entire 
document. ^ 

Another possible solution to getting a new or particular voice is to do voice conver- 
sion, as is done at the Oregon Graduate Institute (<0SI) [kain98] and elsewhere. OGI 
have already released new voices based on this conversion and may release the con- 
version code itself, though the license terms are not\ne same as those of Festival or 
this document. Q 

Another aspect of a new voice in an existing language is^^oice in a new dialect. The 
requirements are similar to those of creating a voice in a language. The lexicon 
and intonation probably need to change as well as the waveform generation method 
(a new diphone database). Although much of the text analysis came probably be 
borrowed, be aware that simple things like number pronunci@bn can often change 
between dialects (cf. US and UK English). ^\ 

We also do work on limited domain synthesis in the same framework. For limited do- 
main synthesis, a reasonably small corpus is collected, and used to synthesize a much 
larger range of utterances in the same basic style. We give an examplejof recording a 
talking clock, which, although built from only 24 recordings, generatss/Wer a thou- 
sand unique utterances; these capture a lot of the latent speaker characljanstics from 
the data. 



24 



Chapter 4. Basic Requirements 



Selecting a speaker 

We have found that choosing the right speaker to record, is actually as important 
as all the the other processes we describe. Some people just have better voices that 
are better for synthesis than others. In general people with, clearer, more consistent 
voices are better than others but unfortunately its not as clear as that. Professional 
speakers are in general better for synthesis that non-professional. Though not all pro- 
fessional voices work, and many non-professional speakers give good results. 

In general you are looking for clear speakers, who don't mumble and don't have 
any speecMmpediments. It helps if they are aware of speech technology, i.e. have 
some vague^dea of what a phoneme is. A consistent deliver is important. As dif- 
ferent parts ^Jspeech from different parts of the recorded database are going to be 
extracted and^uf together you what the speech to be as consistent as possible. This is 
usually the quality that professional speakers have (or any one used to public speak- 
ing). Also note mosi people can't actually talk for long periods without practice. Lec- 
tures/Teachers ar&^ctually much more used to this than students, though this ability 
can be learned quits'e^sily. 

Note choosing the rigTji speaker, if its important to you, can be a big project. For 
example, an experimen^dyne at AT&T to select good speakers for synthesis involved 
recording fair sized databases for a number of professional speakers (20 or so) and 
building simple synthesis\example from their voice and submitting these to a large 
number of human listeners ((^)get them to evaluate quality [syrdal??]. Of course most 
of us don't have the resourc^STo do searches like that but it is worth taking a little 
time to think of the best speal^en before investing the considerable time in takes in 
building a speaker. \ . 

Note that trying to capture a paracular voice is still somewhat difficult. You will 
always capture some of that persd5j#Voice but its unlikely a synthesizer built from 
recordings of a person will always scnmd just like that person. However you should 
note that voice you think are distincdvj£»may be so because of lots variation. For 
example Homer Simpson's voice is distinctive but it would be difficult to built a 
synthesizer from. The Comic Book GuyC(also from the Simpsons) also has a very 
distinctive voice but is much less varied prmodically than Homer's and hence it iss 
likely to be easier to build a synthesizer frora^us voice. Likewise, Patrick Stewart's 
voice should be easier than Jim Carey's. ^ _ 

However as it is usually the case that you just nave to take any speaker you have 
willing to do it (often yourself), there are still thirtgsi»ou should do that will help the 
quality. It is best is recording is done in the same^tsj^sion, as it is difficult to set up 
the same recording environment (even when you are^ery careful). We recommend 
recording some time in the morning (not immediateiy^you get up), and if you must 
re-record do so at the same time of day. Avoid recordmg when the speaker has a 
cold, or a hangover as it can be difficult to recreate that spite if multiple sessions are 
required. 

Who owns a voice 

It is very important that your speaker and you understand the legal status of the 
recorded database. It is very wise that the speaker signs a statementhefore you start 
recording or at least talk to them ensuring they understand what you^yrant to do with 
the data and what restrictions if any they require. Remember in recording their voice 
you are potentially allowing anyone (who gets access to the database^KjD fake that 
person's voice. The whole issue of building a synthetic voice from recordings is still 
actually an uninvestigated part of copyright but there are clear ways to ensure you 
wont be caught out by a law suit, or a disgruntled subject later. 

Explain what you going to do with the database. Get the speaker to agree to the level 
use you may make of the recordings (and any use of them). This will roughly be: 



25 



Chapter 4. Basic Requirements 



• free for any use 

• free to distribute to anyone but cannot be used for commercial purposes without 
further contract. 

• research use only (does this allow public demos?) 

• fully proprietary 

You must find out what the speaker agrees to before you start spending your time 
recording. There is nothing worse than spending weeks on building a good voice 
only to discover that you don't have rights to do anything with it. 



Also, don'tllgHo the speaker make it clear, what it means if their voice is to be released 
for free. If yStfaelease the voice on the net (as we do with our voices), anyone may 
use it. It coulcr^a used anywhere, from reading porn stories to emergency broadcast 
systems. Also note that effectively building a voice from a synthesizer means that 
the person will no /longer be able to use voice id systems as a password protection 
(actually that depends on the type of voice id system). However also reassure them 
that these extremesXyx very unlikely and actually they will be contributing to world 
of speech science and(^}ople will use their voice because they like it. 

We (KAL and AWB) hart^) already given up the idea that our voices are in anyway 
ours and have recorded databases and made them public (even though AWB has a 
funny accent). When recorcmig others we ensure they understand the consequences 
and get them to explicitly sign a license that gives us (and /or our institution) the 
rights to do anything they wisjh, but the intention is the voice will be released for 
free without restriction. From outpoint of view, having no restrictions is by far the 
easiest. We also give (non-exclusive) commercial rights to the voice to the speaker 
themselves. This actually costs us nothing, and given most of our recorded voices are 
for free the speaker could re-releas^he free version and use it commercially (as can 
anyone else) but its nice that the origin^ license allows the speaker direct commercial 
rights (none that I know of have actually\done anything with those rights). 

There may be other factors though. Someone else may be paying for the database 
so they need to be accommodated on any such license. Also a database may already 
be recorded under some license and you wish to use it to build a synthetic voice, 
make sure you have the rights to do this, fts^amazing how mainly people record 
speech databases and don't take into account ths fact that someone else may build a 
general TTS systems from their voice. Its bettentnat you check that have to deal with 
problems later. 



An example of the license we use at CMU is given in the festvox distribution 

f estvox/src/vox_f iles/ speaker . licence. v. — 

Also note that there are legal aspects to other parts oQ) synthetic voice the builder 
must also ensure they have rights to. Lexicons may ha^5 various restrictions. The 
Oxford Advanced Learners' Dictionary that we currenrlyuse for UK English voices 
is free for non-commercial use only, thus effectively imposfflgihe same restriction on 
the complete voice even though the prosodic models and dtphji>ne databases are free. 
Also be careful you check the rights when building models frofn\ existing data. Some 
databases are free for research only and even data derived froirfjhem (e.g. duration 
models) may not be further distributed. Check at the start, question all pieces of the 
system to make sure you know who owns what and what restrictions they impose. 
This process is worth doing at the start of a project so things are alw^js clear. 

% 

Recording under Unix V 

Although the best recording conditions can't probably be achieved recording directly 
to a computer (under Unix or some other operating systems). We accept that in many 
cases recording directly to a computer has many conveniences outweighing its dis- 
advantages. 



26 



Chapter 4. Basic Requirements 



The disadvantages are primarily in quality, the electromagnetic noise generated by a 
machine is large and most computers have particularly poor shielding of there audio 
hardware such that noise always gets added to the recorded signal. But there are 
ways to try to minimize this. Getting better quality sound cards helps, but they can be 
very expensive. "Professional" sound cards can go for as much as a thousand dollars. 

The advantage of using a computer directly is that you have much more control over 
the recording session and first line processing can down at record-time. In recent 
years we found the task of transferring the recorded data from DAT tapes to a com- 
puter, and splitting them into individual files, even before phonetic, labeled a signif- 
icantly laborious task, often larger and resource intensive than the rest of the voice 
building prg^ss. So recently we've accept that direct recording to disk using a ma- 
chine worthwhile, except for voices that require the highest quality (and when we 
have the moriey-Ho take more time). This section describes the issues in recording 
under Unix, though they mostly apply under Windows too if you go that route. 

The first thing yot$l find out about recording on a computer is that no one knows 



when it works its often^tjll flakey. 

First you want to ensure audio works on the machine at all. Find out if anyone 
has actually heard any aud?^) coming from it. Even though there may be an audio 
board in there, it may not hav,p&riy drivers installed or the kernel doesn't know about 
it. In general, audio rarely, "jus^'wprks" under Linux in spite of people claiming Linux 
is ready for the desktop. But before you start claiming Windows is better, we've found 
that audio rarely "just works" there too. Under Windows when it works its often 
fine, but when it doesn't the generaKWindows user is much less likely to have any 
knowledge about how to fix it, whitest least in the Linux world, users have more 
experience in getting recalcitrant devCces to come to life. 

Its difficult to name products here as rn over in PC hardware is frantic. Gener- 

ally newer audio cards wont work and c^lcrer card do. For audio recording, we only 
require 16bit PCM, and none of the fancy/^FJyl synthesizers and wavetable devices, 
those are irrelevant and often make card dffiieult or very hard to use. Laptops are 
particular good for recording, as they generally add less noise to the signal (espe- 
cially if run on the battery) and they are portable enough to take into a quite place 
that doesn't have desktop cooling fan running rn the background. However sound 
on Laptops under Unix (Linux, FreeBSD, Solaris e|c^is unfortunately even less likely 
to work, due to leading edge technology and proprietary audio chips. Linux is im- 
proving in this area but although we are becoming Relatively good at getting audio 
to work on new machines, its still quite a skill. 

In general search the net for answers here. Linux offer^pth ALSA 1 and the Open 
Sound drivers 2 drivers which go a long way to help. Nfttefthough even when these 
work there may be other problems (e.g. on one laptop youxan have either sound 
working or suspend working but not both at once). "^kJ 

To test audio you'll need something to modify the basic ga hQn the audio drivers 
(e.g. xmixer under Linux /FreeBSD, or gaintool under Solari^vAnd you can test 
audio out with the command (assuming you've set estdir} 



which should play a US male voice saying "She had your dark suit in greasy wash- 
water all year." 

To test audio in you can use the command 
$ESTDIR/bin/ na_record -o file.wav -time 3 




$ESTDIR/bin/na_play $ESTDIR/lib/example_data/kdt_001.wav 




27 



Chapter 4. Basic Requirements 



where the time given is the number of seconds to record. Note you may need to 
change the microphone levels and/ or input gain to make this work. 

You should look at the audio signal as well as listen to it. Its often quite easy to see 
noise in a signal than hear it. The human ear has developed so that it can mask out 
noise, but unfortunately not developed enough to mask all noise in synthesis. But in 
synthesis when we are going to concatenated different parts of the signal the human 
ear isn't as forgiving. 

The following is a recording made with background noise, (probably a computer). 



samples 




1000 
3000 



Example waveform recorded with lots^Background noise 

The same piece of speech (re-iterated) in (fquiet environment looks like 

4 



1 1.1 

Time (s) 



samples 



5000 " 




5000 



1.1 1.2 
O Time fs) 

Example waveform recorded in clean environment 

As you can see the quiet parts of the speech are much quieter in the clean case than 
the noise case. 

Under Linux there is a Audio-Quality-HOWTO document that helps get audio up and 
running. AT time of writting it can be found http: / / audio.netpedia.net/ aqht.html 



28 



Chapter 4. Basic Requirements 



Extracting pitchmarks from waveforms 

Although never as good as extracting pitchmarks from an EGG signal, we have had a 
fair amount of success in extracting pitchmarks from the raw waveform. This area is 
somewhat a research area but in this section we'll give some general pointers about 
how to get pitchmarks form waveforms, or if not at least be able to tell if you are 
getting reasonable pitchmarks from waveforms or not. 

The basic program which we use for the extraction is pitchmark which is part of the 
Speech Tools distribution. We include the script bin/make_pm_wave (which is copied 
by ldom ag^diphone setup process). The key line in the script is 

$ESTDIR/ >itchmark tmp$$.wav -o pm/$fname.pm -otype est \ 
-min 0.005 0.012 -fill -def 0.01 -wave_end \ 
-lx_lf 200 -lx_lo 51 -lxhf 80 -lx_ho 51 -med_o 0 

This program incoming waveform (with a low and a high band filter, then 

uses autocorellatiow'nj) find the pitch mark peaks with the min and max specified. 
Finally it fills in the uWpiced section with the default pitchmarks. 

For debugging purpos^vou should remove the -fill option so you can see where 
it is finding pitchmarks.^Ngxt you should modify the min and max values to fit the 
range of your speaker. Tnexlefaults here (0.005 and 0.012) are for a male speaker in 
about the range 200 to 80 r(S) For a female you probably want values about 0.0033 
and 0.7 (300Mhz to 140Hz). *\ 

Modify the script to your appr<^xMiate needs, and run it on a single file, then run the 
script that translates the pitclmiaxkJile into a labeled file suitable for emulabel 

bin/make_pm_wave wav/ awb_OOoSrSvay 
bin/make_pm_pmlab pm/awb_0001.jmr 

You can the display the pitchmark wifh^^ 
emulabel etc/ emu_pm awb_0001 (^S 

This should should a number of pitchmarkvLver the voiced sections of speech. If 
there are none, or very few it definitely means the'parameters are wrong. For example 
the above parameters on this file taataataa preperjy find pitchmarks in the three 
vowel sections 




% 



29 



Chapter 4. Basic Requirements 



4000 



2000 



-2000 




0 



0.4 



0.5 



0.6 



0.1 ($0.2 0.3 

Pitchmarks in waveform sigrfap 

It the high and low pass filter v^fhies -lx_lf 2 00 -lx_hf 80 are in appropriate for 
the speakers pitch range you ma^get either too many, or two few pitch marks. For 
example if we change the 200 to 6 0"^y^e find only two pitch marks in the third vowel. 



4000 



2000 



0.7 0.8 



-2000 




0.1 0.2 0.3 0.4 



0.5 . 0.6 



Bad pitchmarks in waveform signal 

If we zoom in our first example we get the following 



% 



0.7 0.8 
Time fs) 



30 



Chapter 4. Basic Requirements 



4000 



2000 
it, 



0 



-2000 



% 



i 




-i i i— 





j i i_ 





OB© 0.61 0.62 



0.63 0.64 0.65 
Time (s) 



Close-up of pitchmarks in wara^rm signal 

The pitch marks should be align«$d to the largest (above zero) peak in each pitch 
period. Here we can see there arevraq many pitchmarks (effectively twice as many). 
The pitchmarks at 0.617, 0.628, 0.63y>and 0.650 are extraneous. This means our pitch 
range is too wide. If we rerun changirfg^the min size, and the low frequency filter 

\>' 

$ESTDIR/bin/ pitchmark tmp$$.wav -o pm^fname.pm -otype est \ 
-min 0.007 -max 0.012 -fill -def 0.01 -wavekenfi \ 
-lx_lf 150 -lx_lo 51 -lx_hf 80 -lx_ho 51 -medfo S 



We get the following 



4000 



2000 



-2000 








0.6 



0.61 



0.62 



0.63 



0.64 



Close-up of pitchmarks in waveform signal (2) 



0.65 
Time fs) 

32 



0.66 



Chapter 4. Basic Requirements 



Which is better but its now missing pitchmarks towards the end of the vowel, at 0.634, 
0.644 and 0.656. Giving more range for the min (0.005) gives slight better results, but 
still we get bad pitchmarks. The double pitch mark problem can be lessened by not 
only changing the range but also the amount order of the high and low pass filters 
(effectively allowing more smoothing). Thus when secondary pitchmarks appear in- 
creasing the -lx_lo parameter often helps 



$ESTDIR/bin/pitchmark tmp$$.wav -o pm/$fname.pm -otype est \ 
-min 0.005 -max 0.012 -fill -def 0.01 -wave_end \ 
-lxjf 15£-lx_lo 91 -lx_hf 80 -lx_ho 51 -med_o 0 



We get the IjHio wing 




le^W^iol 



Close-up of pitchmarks in waveform signal (3) 

This is satisfactory this file and probably for the Whole databases of that speaker. 
Though it is worth checking a few other files to getsJre best results. Note the by in- 
creasing the order of the filer the pitchmark creep forwinjd (which is bad). 

If you feel brave (or are desperate) you can actually edft the pitchmarks yourself 
with emulabel. We have done this occasionally especiany /when we find persistent 
synthesis errors (spikes etc). You can convert a pm_lab fflejjack into its pitchmark 
format with 



bin/make_pm_pmlab pm_lab/*.lab 



O 



An post-processing step is provided that moves the predicted pitchrrSks to the near- 
est waveform peak. We find this useful for both EGG extracted pitchmiJrjSts and wave- 
form extracted ones. A simple script is provided for this 

bin/ make_pm_fix pm/*.pm 



32 



Chapter 4. Basic Requirements 



If you pitchmarks are aligning to the largest troughs rather than peaks your signal is 
upside down (or you are erroneously using -inv. If you are using -inv, don't, if you 
are not, then invert the signal itself with 

for i in wav/*.wav 
do 

ch_wave -scale -1.0 $i -o $i 
done 



Note the abc^sje are quick heuristic hacks we have used when trying to get pitchmarks 
out of wave Steals. These require more work to offer a more reliable solution, which 
we know exis^-Cxtracting (fixed frame) LPC coefficients and extracting a residual, 
then extracting gitchmarks could give a more reliable solution but although all these 
tools are available, we have not experimented with that yet. 




Notes 





o 





33 



Chapter 4. Basic Requirements 




34 



Chapter 5. Limited domain synthesis 



This chapter discusses and gives examples of building synthesis systems for limited 
domains. By limited domain, we mean applications where the speech output is con- 
strained. Such domains may still be infinite but they may be target to specific vocab- 
ulary and phrases. In fact with today's current speech system such limited domain 
applications are in fact the most common. Some typical examples are telling the time, 
reading telephone numbers. However from experience we can see that this technique 
can be extended to include more general information giving systems and dialog sys- 
tems, such/^s reading the weather, or even the DARPA Communicator domain (flight 
informatiofT^ialog system). 

Limited domajrts are discussed here as it is felt that it should be easier to build unit se- 
lection type symnesizers for domains where there are a much smaller and controlled 
number of units* The second reason is that general TTS systems (e.g. diphone sys- 
tems) still sound s\Oithetic. General unit selection when its good, offers near human 
quality, but when itsj^d it is usually much worse than a diphone synthesizer. Hybrid 
systems look interesting but as we cannot yet automatically detect when general unit 
selection systems go tOa, ifs not clear when a diphone system should be swapped in. 
But as unit selection offera so much promise, it is hoped that in a limited domain we 
can get the unit selectioK/good quality, and avoid the bad quality. Finally, although 
full TTS systems may be ouriultimate goal actually for many existing systems a lim- 
ited domain synthesizer is adequate. 

There is a stage beyond limited domain, but falling short of general one synthesis, 
where the most common phras^are the best synthesized and the quality gracefully 
degrades as the phrases become ,Jefc common. Some hybrid recorded prompts /unit 
selection/ diphone systems have been proposed and should be able to deliver and 
answer but we will not deal directlyJwith those here. 

However one point you quickly find k that although most speech dialog systems are 
very constrained in their vocabulary m^ny require the hardest class of words: proper 
names. 

In continuing in mode of tutorial this chapt£r>first gives a complete walkthrough of a 
talking clock. This is a small example wmcnv^ll probably work. Following through 
this example will give you a good idea of what is involved in building a limited 
domain synthesizer. Also in the following sectisjW problems and modifications can be 
better discussed with respect to this complete example. 

designing the prompts X 

To get a good limited domain synthesizer, it is importanfjle. understand what is going 
on so you can properly tailor your application to take proper advantage of what can 
be good, and to avoid the limitations these methods impo^i 

Note that you may wish to change your application to tak^^tter advantage of this 
by making its output forms more regular, or at least use a s(fn^ller vocabulary. As 
a first basic approximation the techniques here require that the^aining contain all 
words which are to be actually synthesized. Therefore if a word- does not appear in 
the training data it cannot be synthesized. We do provide fall baak positions, using 
a diphone voice, but that will always be worse than the more natu^al^unit selection 
synthesis. Often this isn't much of a restriction, or you can tailor you* application 
to avoid having a large vocabulary. For example if you are going to bmld a system 
for reading weather reports you can make the weather reports not actuary name the 
city/ town they refer to and just use phrases like "This city ..." and depend of context 
for the user to know which actually city is being talked about. 

Of course many speech applications have limited vocabularies in all but a very few 
places. Proper names such as places, people, movie names, etc are in general com- 
plete open classes. Building a speech application around those aspects isn't easy and 
may make a limit domain synthesizer just impractical. But it should be noted that 

35 



Chapter 5. Limited domain synthesis 



those open classes are also the classes that more general synthesizers will often fail 
on too. Some hybrid system may better solve that, which we will not really deal with 
here. 

For almost closed class, recording and modify the data may be a solution but we 
have not yet got enough experience to comment on this yet but we feel that may be a 
reasonable compromise. 

The most difficult part of building a limited domain synthesizer is designed the 
database to record that best covers what you wish the synthesizer to say. Sometimes 
this is fairhAeasy in that you wish the synthesizer to simple read utterances in a very 
standard f&rfa where slot will be filled with varying values (such as, dates numbers 
etc.) Such as'ZA 

The area code you require is NUMBER NUMBER NUMBER. 

The prompts ca r?&d evised to fill in values for each of the NUMBER variables. 
More complex utterance can still be viewed in this way 

The weather at TIME o>mATE: outlook OUTLOOK, NUMBER degrees. 

But once we move into rSre general dialog its appears initially harder to properly 
find the utterances that covsMhe domain. 

The first important observation* to make is that in such systems where limited do- 
main synthesis is practical the v ^nrases to be spoken are almost certainly generated 
by a computer. That is there exist^aji explicit function which the language generated 
by the applications. In some case v trus will take the form of an explicit grammar. In 
this case we can use that grammar generate phrase language and then select from 
them utterance that adequately cove^the domain. However even when there is an 
explicit grammar it usually will not aJJ£w explicitly encode the frequency of each 
generated utterance. As we wish to ensujarthat the most common phrases are syn- 
thesized best we ideally need to know whidi utterances are to be synthesized most 
often to properly select which utterance to (record. 

Where a system is already running with a sta^ard synthesizer it is possible to record 
what is currently being said and how often, fflfy can then use such logs of current 
system usage to select which utterance should Be in our set of prompts to record. 

In practice you will be design the system's outpwFTat the same time as the limited 
domain synthesizer so some combination, and guessing of the frequency and cover 
will be necessary. 

In general you should design your databases to have aHeast 2 (and probably 5) ex- 
amples of each word in you vocabulary. Secondly you^atjjwuld select utterances that 
maximise bi-gram coverage. That is try to ensure as manyrcCfferent word-word pair- 
ings over your corpus. We have used techniques based on the&e recommendations to 
greedily select utterances from larger corpora to record. 




customizing the synthesizer front end • 

Once you decided on a set of utterances that appropriately cover the^dcVnain you also 
need to consider how those particular text strings are synthesized. Foi{e^ample if the 
data contains flight numbers, dates, times etc, you must ensure that festiyid properly 
renders those. As we are discussing a limited domain the distribution oMoken types 
will be different from standard text but also more constrained so simple changes to 
the lexicon, token to word rules, etc. will allow properly synthesis of these utterances. 

One particular area of customization we have noted is worthwhile is that of phras- 
ing. It seems important to explicitly mark phrasing in the prompts, and have the 
speaker follow such phrasing as it allows for much better joins in unit selection, as 



36 



Chapter 5. Limited domain synthesis 



well as consist prosody from the speaker. Thus in the default code provided below 
the normal phrasing module in festival is replaced with one that treat punctuation as 
phrasal markers. 



autolabeling issues 



We currently use autolabel the recorded prompts using the alignment technique 
based on [malfrere97] that we discussed above for diphone labeling. Although 
this technique works its is not as robust for general speech as it is for carefully 
articulated nonsense words. Specifically this technique does not allow for alternative 
pronunciatiofe.jvhich are common. 

Ideally we shomcl use a more generally speech recognition systems. In fact for label- 
ing of unit selection database the bets results would be to train a recognition using 
Baum- Welch on thj^database until convergence. This would give the most consis- 
tent labeling. We haW started initial experiments with using the open source CMU 
Sphinx recognition system are are likely to provide scripts to do this in later releases. 

In the mean time the^jgAtest problem in predict phone list must be the same (or 
very similar) to what v%as actually spoken. This can be achieved by a knowledge 
speaker, and by customizing the front end of the synthesizer so it produces more 
natural segments. ^ 

Two observations are worth ivteSitioning here. First if the synthesizer makes a mistake 
in pronunciation and the human speaker does not we have found the things may 
work out anyway. For examplewe found the word "hwy" appeared in the prompts, 
which the synthesizer pronounceXfas "H W Y" while the speaker said "highway". The 
forced alignment however cause theOjhones in the pronunciation of the letters "H W 
Y" to be aligned with the phones m jTi^hway" thus on selection appropriate, though 
misnamed, phones were selected. H»wey,er you should depend on such compound 
errors. \^ 

The second observation is that festival ides prediction of vowel reduction. We 
are beginning to feel that such prediction (5T>innecessary in limited domain synthe- 
sizer or in unit selection in general. This vas^l variation can itself be achieve but 
the clustering technique themselves and hence allows a reasonable back-off selection 
strategy than making firm type decisions. \v 



\5 



unit size and type y> 

The basic cluster unit selection code available in festiv^lXises segments as the size of 
unit. However the acoustic distance measure used in clu@f uses significant portions 
of the previous segment. Thus the cluster unit selectiorfeffectively selects diphones 
from the database. 

The type of units the cluster selection code uses is based orrthe segment name, by 
default. In the case of limited domain synthesis we have f oun&inat constraining this 
further gives both better, and faster synthesis. Thus we allow fc^fthe unit type to be 
defined by an arbitrary feature. In the default limited domain set up we use 

SEGMENT_WORD 

That is, the segment plus the word the segment comes from. Note this t^jtfesn't mean 
we are doing word concatenation in our synthesizer. We are still selecting pnone units 
but that the these phone are differentiated depending on the word they come from 
thus a /t/ from the word "unit" cannot be used to synthesis a /t/ in "table". The pri- 
mary reason for us doing this was to cut down the search, though it notable improves 
synthesis quality to. As we have constructed the database to have good coverage this 
is a practical thing to do. 



37 



Chapter 5. Limited domain synthesis 



The feature function clunit_name constructs the unit type for a particular segment 
item. We have provided the above default (segment name plus (downcased) word 
name), but it is easy to extend this. 

In one domain we have worked in we wish to differentiate between words in differ- 
ent prosody contexts. Particularly we wished to mark words us "questionable" so we 
can ask users for confirmation. To do this we marked the "questionable" words in the 
prompts with a question mark prefix. We then recorded them with appropriate into- 
nation and then defined our clunit_name function in include "C_" is the word was 
prefixed by a question mark. For example the following two prompts will be read in 
a differenfr^inner 

theater is Sqi^jfrel Hill Theater 
theater is ?Sq3rAel ?Hill ?Theater 
• 

Likewise in unitrsfolection the units in the word "Squirrel" will not be used to syn- 
thesize the word TS^irrel" and vice versa. Although crude, this does give simple 
control over prosody^variation though this technique can require the vocabulary of 
the units to increase t©s^here this technique ceases to be practical. 



It would be good if thfs^rechnique had a back-off strategy where if no unit can be 
found for a particular woi^it would allow other words to contribute candidates. This 
is ultimately what generaFunit selection is. We do consider this our goal in unit type 
but in the interest of building^cniick and reliable limited domain synthesizers we do 
not yet do this but consider if^H area we will experiment with. One specific area that 
only partially cross this line is 'j^iH'he synthesis of numbers. It seem very reasonable 
to allow selection of units from s^Htole numbers (e.g. "seven" and "seventy") but we 
have not experimented on that yet: 

One further important point shouldpe^ighlights about this method for defining unit 
types. Although including the word same in the unit name does greatly encourage 
whole words to be selected it does no^mean that joins in the synthesize utterances 
only occur at word boundaries. It is commoti that contiguous units are selection from 
different occurrences of the same word. Mjd-word (e.g. within vowels, or at stops) 
joins at stable places are common. The op^rial coupling technique selects the best 
place within a word for the cross over betwe^rrtwo different parts of the database. 

using limited domain synthesizers *^ 

The goal of building such limited domain synthesis^ef^is not just to show off good 
synthesis. We followed this route as we see this as a vor^ practical method for build- 
ing speech output systems. 

For practical reasons, the default configures includes Tne^oossibility of a back-up 
voice that will be called to do synthesis if the limited domain synthesizer fails, 
which for this default setup means the phrase includes a Ott^^f vocabulary word. It 
would perhaps be more useful if the fall position just reqtfThed synthesis of the 
out of vocabulary word itself rather than the whole phrase, Bm^hat isn't as trivial 
as it might be. The limit domain synthesis does not prosody vnodification of the 
selections, except for pitch smooth at joins, thus slotting in a diphone one word 
would sound very bad. At present each limited domain synthesizer(ffas an explicitly 
defined closest_voice. This voice is used when the limited domain synthesis 
fails and also when generating the prompts, which can be looked upe«jas absolute 
minimal case when the synthesizer has no data to synthesize from. <0 

There are also issues in speed here, which we are still trying to improve. This tech- 
nique should in fact be fast but it is still slower than our diphone synthesizer. One 
significant reason is the cost if finding the optimal join put in selected units. Also this 
synthesizer technique require more memory that diphones as the cepstrum parame- 
ters for the whole database are required at run time, in addition to the full waveforms. 
These issues we feel can and should be addressed as these techniques are not funda- 

38 



Chapter 5. Limited domain synthesis 



mentally computationally expensive so we intend to work on these aspect in later 
releases. 



Telling the time 



Festival includes a very simple little script that speaks the current time 
(@filejfestival/examples/saytime|). This section explains how to replace the 
synthesizer used from this script with one that talks with your own voice. This is an 
extreme ej^mple of a limited domain synthesizer but it is a good example as it 
allows us toj^ve a walkthrough of the stages involved in building a limited domain 
synthesizer. T&hs example is also small enough that it can be done in well under an 
hour. 

Following through this example will give a reasonable understanding of the relative 
importance of ma^fyimportant steps in the voice building process. 

The following task^fape required: 

Designing the promp<t^ 
Customized the synthe^er front end 
Recording the prompts ^ 
Autolabeling the prompts 

Building utterance structuresyor recorded utterances 
Extracting pitchmark and builcl^g LPC coefficients 
Building a clunit based synthesizer faom the utterances 
Testing and tuning ^ . , 

c 

Before starting set the environment variable^ festvoxdir and estdir to the directo- 
ries which contain the festvox distributionSand the Edinburgh Speech Tools respec- 
tively. Under bash and other good shells this(jnay be done by commands like 

CO 

export FESTVOXDIR=/home/awb/ projects/ festvo^c 
export ESTDIR= /home/ awb / projects /1.4.3/speech 



In earlier releases we only offered a command line b^se\i method for building voices 
and limited domain synthesizers. In order to make the-process easier and less prone 
to error we have introduced and graphical front end ttrxhese scripts. This front end 
is called pointyclicky (as it offers a pointy-clicky intei^ee). It is particularly useful 
in the actual prompting and recording. Although pointy^iicky is the recommend 
route in the section we go through the process step by ste»4o give a better under- 
standing of what is required and where problems may lie thai^equire attention. 



A simple script is provided setting up the basic directory stfuirture and copying 
in some default parameter files. The festvox distribution includes all the setup for 
the time domain. When building for your domain, you will need«to provide the file 
etc/DOMAiN . data contains your prompts (as described below). 

mkdir ~ / data / time 

cd~/data / time <0 
$FESTVOXDIR/src/ldom/ setup_ldom emu time awb 

As in the definition of diphone databases we require three identifiers for the voice. 
These are (loosely) institution, domain and speaker. Use net if you feel there isn't 
an appropriate institution for you, though we have also use the project name that 
the voice is being build for here. The domain name seems well defined. For speaker 
name we have also used style as opposed to speaker name. The primary reason for 

39 



Chapter 5. Limited domain synthesis 



these to so that people do not all build limited domain synthesizer with the same 
thus making it not possible to load them into the same instance of festival. 

This setup script makes the directories and copies basic scheme files into the 
f estvox/ directory. You may need to edit these files later. 

Designing the prompts 

In this say time example the basic format of the utterance is 

The time S^tow, EXACTNESS MINUTE INFO, in the DAYPART. 
For example 

The time is now, ^Jijtle after five to ten, in the morning. 

In all there are 11^4x12x12x2) utterances (although there are three possible day 
info parts (morning, 0^ernoon and evening) they only get 12 hours, 6 hours and 6 
hours respectively). Al^tttmgh it would technically be possible to record all of these 
we wish to reduce the anrfmmt of recording to a minimum. Thus what we actually do 
is ensure there is at least ape example of each value in each slot. 

Here is a list of 24 utteranceC^hat should cover the main variations. 

}ne, in the morning 



The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 


The 


time 


is 


now, 




l the after^^n 



tie arter nail past six, m the evening ^-^ 
:tly twenty-five past seven, in the eveni/ig 
ost twenty past eight, in the evening f*Y 
after quarter past nine, in the evening^-^p^ 
ost ten past ten, in the evening \J . 

:tly five past eleven, in the evening V\) 
tie after quarter to midnight. 

These examples are first put in the prompt file with an uttera<^c% number and the 
prompt in double quotes like this. 

(timeOOOl "The time is now ...") ^yL. 
(time0002 "The time is now ...") \^) 
(time0003"Thetimeisnow...") vO> 

These prompt should be put into etc/DOMAiN . data. This file is used by many of the 
following sub-processes. 



40 



Chapter 5. Limited domain synthesis 



Recording the prompts 

The best way to record the prompts is to use a professional speaker in a professional 
recording studio (anechoic chamber) using dual channel (one for audio and the other 
for the electroglottograph signal) direct to digital media using a high quality head 
mounted microphone. 

However most of us don't have such equipment (or voice talent) so readily avail- 
able so whatever you do will probably have to be a compromise. The head mounted 
mike requirement is the cheapest to meet and it is pretty important so you should 
at least mggt that requirement. Anechoic chambers are expensive, and even profes- 
sional recormjrg studios aren't easy to access (though most Universities will have 
some such fcQSlities). It is possible to do away with the EGG reading if a little care is 
taken to ensus^gitchmarks are properly extracted from the waveform signal alone. 

We have been successful in recording with a standard PC using a standard sound- 
blaster type 16bit a who card though results do vary from machine to machine. Before 
attempting this yon^Would record a few examples on the PC to see how much noise 
is being picked up bj«--the mike. For example try the following 

$ESTDIR/bin/na_reco*0 16000 -time 5 -o test.wav -otype riff 



This will record 5 sesqnds from the microphone in the machine you run the 
command on. You should do this to test that the microphone is plugged in (and 
switched on). Play back thevfejcorded wave with na_play and perhaps play with 
the mixer levels until you get ,< me> least background noise with the strongest spoken 
signal. Now you should display the waveform to see (as well as hear) how much 
noise is there. \^ 

$FESTVOXDIR/src/general/display_»g\est.wav 

uftftsj 



This will display the waveform an dXjtsjs pectrogram. Noise will show up in the 
silence (and other) parts. 

There a few ways to reduce noise. Ensure tKi^microphone cable isn't wrapped around 
other cables (especially power cables). Turn^fig the computer 90 degrees may help 
and repositioning things can help too. Moving/tke sound board to some other slot in 
the machine can also help as well as getting aWfferent microphone (even the same 
make). • » 

There is a large advantage in recording straight tcXdisk as it allows the recording to 
go directly into right files. Doing off-line recording tefito DAT) is better in reducing 
noise but transferring it to disk and segmenting it is a joi^g and tedious process. 

Once you have checked your recording environment yoi(ffon proceed with the build 
process. 

First generate the prompts with the command V"^) 
festival -b festvox/build_ldom.scm '(build_prompts "etc/time. data^^ 
and prompt and record them with the command 

o 

bin/ prompt_them etc / time. data 

You may or may not find listening to the prompts before speaking useful. Simply 
displaying them may be adequate for you (if so comment out the na_play line in 
bin/prompt_them}. 



41 



Chapter 5. Limited domain synthesis 



Autolabeling the prompts 

The recorded prompt can be labeled by aligning them against the synthesize 
prompts. This is done by the command 

bin/make_labs prompt-wav/*.wav 

If the utterances are long (> 10 seconds of speech) you may require lots of swap 
space to do this stage (this could be fixed). 

Once labeldayou should check that they are labeled reasonable. The labeler typically 
gets it pretf^friuch correct, or very wrong, so a quick check can often save time later. 
You can checlrthe database using the command 

emulabel etc/ ej-iu_lab 

Once you are happjs^vith the labeling you can construct the whole utterance struc- 
ture for the spoken utterances. This is done by combining the basic structure from the 
synthesized prompts (jJjpfcl the actual times from the automatically labeled ones. This 
can be done with the co < H^iand 

festival -b festvox/build_^^|m.scm '(build_utts "etc/time. data")' 



Extracting pitchmarks ancK^iilding LPC coefficients 

Getting good pitchmarks is important to the quality of the synthesis, see 
the Section called Extracting pitchmarks from waveforms in Chapter 4 for more 
detailed discussion on extrating pitbjunarks from waveforms. For the limited 
domain synthesizers the pitch extractors* a little less crucial that for diphone 
collection. Though spending a little time on^this does help. 

If you have recorded EGG signals the you capf^ise bin/make_pm from the . lar files. 
Note that you may need to add (or remove) ihe. option -inv depending on the up- 
downness of your EGG signal. However so farspfly the CSTR larygnograph seems to 
produce inverted signals so the default should be»adeauate. Also note the parameters 
that specify the pitch period range, -rain and max^re default setting are suitable for 
a male speaker, for a female you should modify theSMp something like 

-min 0.0033 -max 0.0875 -def 0.005 O 

The changing from a range of (male) 200Hz-80Hz wlthra default of 100Hz, to a 
female range of 300Hz-120Hz and default of 200Hz. ^> 

If you don't have an EGG signal you must extract the pitch frratL the waveform itself. 
This works though may require a little modification of parameteri^and it is computa- 
tionally more expensive (and wont be as exact as from an EGG s(gnal). There are two 
methods, one using Entropic's epoch program which work pretty^well without tun- 
ing parameters. The second is to use the free Speech Tools programed, tchmark. The 
first is very computationally expensive, and as Entropic is no longeMrr existence, the 
program is no longer available (though rumours circulate that it may^ppear again 
for free). To use epoch use the program vO. 

bin/make_pm_epoch wav/*.wav 
To use pitchmark use the command 

bin/ make_pm_wave wav/*.wav 



42 



Chapter 5. Limited domain synthesis 



As with the EGG extraction pitchmark uses parameters to specify the range of the 
pitch periods, you should modify the parameters to best match your speakers range. 
The other filter parameters also can make a different to the success. Rather than try to 
explain what changing the figures mean (I admit I don't fully know), the best solution 
is to explain what you need to obtain as a result. 

Irrespective of how you extract the pitchmarks we have found that a post-processing 
stage that moves the pitchmarks to the nearest peak is worthwhile. You can achieve 
this by 



bin/mak(jy*m_fix pm/*.pm 

\ 

At this point you may find that your waveform file is upside down. Normally this 
wouldn't matter but due to the basic signal processing techniques we used to find 
the pitch periods Jipside down signals confuse things. People tell me that it shouldn't 
happen but some rertiprding devices return an inverted signal. From the cases we've 
seen the same devia^aiways returns the same form so if one of your recordings 
is upside down all onilWn probably are (though there are some published speech 
databases e.g. BU Radio^data, where a random half are upside down). 

In general the higher pea® should be positive rather than negative. If not you can 
invert the signals with the d^imand 

for i in wav / *. wav * 

ch_wave -scale -1.0 $i -o $i \) 
done 

If they are upside, invert them and r«-run the pitch marking. (If you do invert them 
it is not necessary to re-run the segmer^fabeling.) 

Power normalization can help too. This d^be done globally by the function 

<£> 

bin/simple_powernormalize wav/*. wav 

This should be sufficient for full sentence ex®ipl es. In the diphone collection we 
take greater care in power normalization but thai vowel based technique will be too 
confused by the longer more varied examples. \X 

Once you have pitchmarks, next you need to generajen:he pitch synchronous MEL- 
CEP parameterization of the speech used in building tjae^cluster synthesizer. 

bin/make_mcep wav/*. wav 

O 

Building a clunit based synthesizer from the utterances 

Building a full clunit synthesizer is probably a little bit of over kilfhu t the technique 
basically works. See Chapter 12 for a more detailed discussion of unit selection tech- 
nique. The basic parameter file f estvox/time_build . scm, is reason@le as a start. 

festival -b festvox/build_ldom.scm '(build_clunits "etc/time. data")' 

If all goes well this should create a file 

f estival/clunits/cmu_time_awb . catalogue and set of index 
trees in festival/trees/ cmu_time_awb_time . tree. 



43 



Chapter 5. Limited domain synthesis 



Testing and tuning 

To test the new voice start Festival as 



festival festvox/cmu_time_awb_ldom.scm '(voice_cmu_time_awb_ldom)' 

The function (saytime) can now be called and it should say the current time, or 

(saythistime "11:23"). 

Note this synthesizer can only say the phrases that it has phones for which basically 
means it a^only say the time in the format given at the start of this chapter. Thus 
although you*can use Say Text it will only synthesis words that are in the domain. 
That's what mtyted domain synthesis is. 

A full directory structure of this example with the recordings and parameters 
files is available at http://festvox.org/examples/cmu_time_awb_ldom/. 
And an on-lin^Vlemo of this voice in that directory is available at 
http: / / festvox.org/(g^imples/cmu_time_awb_ldom/ 2 . 



Making it better 

The above walkthough is tcCgVye you a basic idea of the stages involved in building 
a limited domain synthesizeffThe quality of a limited domain synthesizer will most 
likely be excellent in parts and^^ry bad in others which is typical of techniques like 
this. Each stage is, of course, more>aomplex than this and there are a number of things 
that can be done to improve it. S 



For limited domain synthesize it should be possible to correct the errors such that 
it is excellent always. To do so though requires being able to diagnose where the 
problems are. The most likely problema^re listed here 

• Mis-labeling Due to lipsmacks, and othep-Veasons the labeling may not be correct. 
The result may the wrong, extra or missing^s«gments in the synthesized utterance. 
Using emulabel you can check and hand correct the labels. 

• Mis-spoken data The speaker may have ma® a mistake in the content. This can 
often happen even when the speaker is careiul. Mistakes can be actual content 
(it is easy to read a list of number wrongly), bwf^lso hesitations and false starts 
can make the recording bad. Also note that mconsj6tent prosodic variation can 
also affect the synthesis quality. Re-recording can Re considered for bad examples, 
or you can delete them from the etc/LDOM. data l(sj) assuming there is enough 
variation in the rest of the examples to ensure proper^O^erage of the domain. 

• Bad pitchmarking Automatic pitchmarking is not re^ly automatic. It is very 
worthwhile checking to see if it is correct and re-runnin&4he pitchmarking with 
better parameters until it is better. (We need better documentation here on how to 
know what "correct" is.) Kj 

• Looking at the data There is never a substitute for actually looking at the data. Use 
emulabel to actually look at the recorded utterances and see wiiat the labeling is. 
Ensure these match and files haven't got out of order. Look at a rcandom selection 
not just the first example. 

• Improving the unit clustering The clustering techniques and the featu^s used here 
are pretty generic and by no means optimal. Even for the simple example given 
here it is not very good. See Chapter 12 on unit selection for more discussion on 
this. Adding new features for use in cluster may help a lot. 

The line between limited domain synthesis and unit selection is fuzzy. The more com- 
plex and varied the phrases you synthesize are, the more difficult it is to produce 
reliable synthesis. 



44 



Chapter 5. Limited domain synthesis 



Notes 

1. http: / /festvox.org/ examples /cmu_time_awb_ldom/ 

2. http://festvox.org/ldom/index.html 




45 



Chapter 5. Limited domain synthesis 




46 



Chapter 6. Text analysis 



This chapter discusses some of the basic problems in analyzing text when trying to 
convert it to speech. Although it is oftain considered a trival problem, not worthy of 
specnding time one, to anyone who has to actually listen to general text-to-speech 
systems quickly realises it is not as easy to pronounce text as it first appears. Num- 
bers, symbols, acronyms, abbreviations apear to various degrees in different types 
of text, even the most general types, like news stoires and novels still have tokens 
which do not have a simple pronunciaiton that can be found merely by looking up 
the token j^a lexicon, or using letter to sound rules. 

In any new^nguage, or any new domain that you wish to tranfer text to speech 
building an agdorpriate text analysis module is necessary. As an attempt to define 
what we mearS&y text analysis more specifically we will consider this module as tak- 
ing in strings ofcharacters and producing strings of words where we defined words 
to be items for whHtflj a lexicon can provide pronucniations either by direct lookup or 
by some form of leuet^to sound rules. 

The degree of difficuit-of this convertion task depends on the text type and language. 
For example in lanmguaetes like Chinese, Japanese etc., there is, conventionally, no 
use of whitespace characters between words as found in most Western language, 
thus even identfying the(^)ken boundaries is an interesting task. Even in English 
text the proportion of simplsrjpronouncable words to what we will term non-standard 
words can vary greatly. WeSrefine non-standard words (NSWs) to be those tokens 
which do not apear directly m-4he lexicon (at least as a first simplication). Thus to- 
kens contains digits, abbreviations, and out of vocabulary words are all considered to 
be NSWs that require some forn^eW identification before their pronunciation can be 
specified. Sometimes NSWs are aStbiguous and some (often shallow) level of analy- 
sis is necessary to identfiy them. FckJeXample in English the string of digits 19 96 can 
have several different pronunciationtfaepending on its use. If it is used as a year it 
is pronunciation as nineteen ninetyvsix, if it is a quantity it is more likely pronu- 
ounced as one thousand nine hundrSdVand) ninety-six while if it is used as a 
telephone extention it can be pronouncecksimpelas a string of digits one nine nine 
six . Deterimining the appropriate type ctfj^xpansion is the job of the text analysis 
module. ^\ 

Much of this chapter is based on a project that carried out at a summer workshop 
at Johns Hopkins University in 1999 [JHU-NSW^fo] and later published in [SproatOO], 
the tools and techniques developed at that worfcshdp were further developed and 
documented and now distributed as part of the F@Vox project. After a discussion 
of the problem in more detail, concentrating on English^ examples, a full presentation 
of NSW text analysis technique will be given with a 1 simple example. After that we 
will address different appropaches that can be taken in~-F£stival to build general and 
customized text analysis models. Then we will address a\ij»mber of specifc problems 
that appear in text analysis in various languages includin^^iomogra[h disambigua- 
tion, number pronunciation in Slavic languages, and segmej^tion in Chinese. 

o 

Non-standard words analysis 

In an attempt to avoid relying solely on a bunch of "hacky" rules, wte can better define 
the task of analyzing text using a number of statistical trained mo^ejs using either 
labeled or unlabeled text from the desired domain. At first approximation it may 
seem to be a trival problem, but the number of non-standard words is erraugh even in 
what is considered clean text such as press wire news articales to make th^2 synthesis 
sound bad without it. 

Full NSW model description and justification to be added, doan play the following 
(older) parts. 



47 



Chapter 6. Text analysis 

Token to word rules 

The basic model in Festival is that each token will be mapped a list of words by a 
call to a token_to_word function. This function will be called on each token and it 
should return a list of words. It may check the tokens to context (within the current 
utterance) too if necessary. The default action should (for most languages) simply 
be returning the token itself as a list of own word (itself). For example your basic 
function should look something like. 



(define (MYLANG_token_to_words token name) 

"(MYLA«£ token_to_words TOKEN NAME) 
Returns a l«*j?f words for the NAME from TOKEN. This primarily 
allows the treatment of numbers, money etc." 

(cond ^ 

(t 

(list name)))) ^ 



This function shoui^be set in your voice selection function as the function for token 
analysis 




(set! token_to_words M < Y^B^.NG_token_to_words) 

This function should be addel^eo to deal with all tokens that are not in your lexicon, 
cannot be treated by your lettetAv-sound rules, or are ambiguous in some way and 
require context to resolve. 

For example suppose we wish to sitrrgly treat all tokens consisting of strings of digits 
to be pronounced as a string of digrtsirather than numbers). We would add some- 
thing like the following \ 

(set! MYLANG_digit_names r\ 



'((0 "zero") 
(1 "one") 



(2 "two") 



(3 "three") 
(4 "four") 
(5 "five") 
(6 "six") 
(7 "seven' 
(8 

(9 "nine"))) 



'•6 

seven ) \ 
"eight") \^ 

o 

(define (MYLANG_token_to_words token name) vS** 

"(MYLANG_token_to_words TOKEN NAME) , <\ 

Returns a list of words for the NAME from TOKEN. This priirrarjjy 
allows the treatment of numbers, money etc." 

(cond f~\ 

((string-matches name "[0-9]+") ;; any string of digits 
(mapcar 

(lambda (d) , 
(car (cdr (assoc_string d MTLANG_digit_names)))) 

(symbolexplode name))) J - 

(t o 

(list name)))) v ^ 

But more elaborate rules are also necessary. Some tokens require context to disam- 
biguate and sometimes multiple tokens are really one object e.g "$22 billion" must 
be rendered as "twelve billion dollars", where the money name crosses over the second 
word. Such multi-token rules must be split into multiple conditions, one for each part 
of the combined token. Thus we need to identify the "$DIGITS" is in a context fol- 
lowed by "lillion" . The code below renders the full phrase for the dollar amount. The 



48 



Chapter 6. Text analysis 



second condition ensures nothing is returned for the "lillion" word as it has already 
been dealt with by the previous token. 

((and (string-matches name "\\$[123456789]+") 

(string-matches (item.feat token "n.name") "Million.?")) 
(append 

(digits_to_cardinal (string-after name "$")) ;; amount 
(list (item.feat token "n.name")) ;; magnitude 

(list "dollars"))) ;; currency name 

((and (stang-matches name "Million.?") 

(stri^gxmatches (item.feat token "p.name") "\\$[123456789]+")) 
;; dealt ^fth in previous token 

„,„ ^ 

Note this still js not enough as there may be other types of currency pounds, yen, 
francs etc, some of which may be mass nouns and require no plural (e.g. "yen}" and 
some of which malaebe count nouns require plurals. Also this only deals with whole 
numbers of .*illions'V^2.25 million" is common too. See the full example (for English) 

in festival/ lib/to]^M . scm. 

A large list of rules arevfypically required. They should be looked upon as breaking 
down the problem into srpaller parts, potentially recursive. For example hyphenated 
tokens can be split into two. words. It is probably wise to explicitly deal with all 
tokens than are not purelyv^Jphabetic. Maybe having a catch-all that spells out all 
tokens that are not explicitly*9£alt with (e.g. the numbers). For example you could 
add the following as the penurntjlmate condition in your token_to_words function 

o 

((not (string-matches name "[A-Ztt^zT)) 
(symbolexplode name)) \j* 

Note this isn't necessary correct wh^n Gertain letters may be homograpths. For ex- 
ample the token "a" may be a determiner or a letter of the alhpabet. When its a dert- 
erminer it may (often) be reduced) while ^sa letter it probably ins't (i.e pronunciation 
in "@" or "eil". Other languages also examjde this problem (e.g. Spanish "y" . There- 
fore when we call symbol explode we don'twarit just the the letter but to also specify 
that it is the letter pronunciation we want and not the any other form. To ensure 
the lexicon system gets the right pronunciatiof^Ve there wish to specify the part fo 
speech with the letter. Actually rather than just^a string of atomic words being re- 
turned by the token_to_words function the woixls^may be descriptions including 
features. Thus for example we dont just want to return . 

X 

(abc) O 



We want to be more specific and return 



(((name a) (pos nn)) 
((name b) (pos nn)) 



((name c) (pos nn))) 
This can be done by the code , 

o 

((not (string-matches name "[A-Za-z]")) 

(mapcar ^— ^> 
(lambda (1) \p 
((list 'name 1) (list 'pos 'nn))) 
(symbolexplode name))) 

The above assumes that all single characters symbols (letters, digits, punctuation 
and other "funny" characters have an entry in your lexicon with a part of speech field 
nn, with a pronunctiation of the character in isolation. 



49 



Chapter 6. Text analysis 



The list of tokens that you may wish to write /train rules for is of couse language 
dependent and to a certain extent domain dependent. For example there are many 
more numbers in email text that in narative novels. The number of abbreviations is 
also much higher in email and news stories than in more normal text. It may be worth 
having a look at some typical data to find out the distribution and find out what is 
worth working on. For a rough guide the folowing is a list if the symbol types we 
currentl deal with in English, many of which will require some treatment in other 
languages. 



Money ^ 



Money ^jnaounts often have different treatment than simple numbers and con- 
ventions afiout the sub-currency part (i.e. cents, pfennings etc). Remember that 
you its ndrjust numbers in the local currency you have to deal with currency 
values frorrf different countries are common in lots of different texts (e.g dollars, 
yen, DMs an^luro). 

Numbers S 

strings of digits wffl^f course need mapping even if there is only one mapping 
for a language (rare)<Consider at least telphone numbers verses amounts, most 
languages make a di^^ction here. In English we need to distinguish further, see 
below for the more det^Jed discussion. 

number/number v ^ 

This can be used as a datefh paction, alternate, context will help, though tech- 
niques of dropping back to slaying the the string of characters often preserve the 
ambiguity which can be better-rtftat forcing a decision. 

A 

acronyms \ 

List of upper case letters (with or\vi£bout vowels). The decision to pronounce 
as a word or as letters is difficult in general but good guesses go far. If its short 
(< 4 chatacters) not in your lexicon not^surround by other words in upper case, 
its probably an acronym, further analyss-^f vowels, consonant clusters etc will 
help. V ^ 

number-number 



Could be a range, of score (football), dates et 



's or TOKENS 



word-word 

Usually a simple split on each part is sufficient — bcr^a£t as when used as a dash. 

word/word 

As an alternative, or a Unix pathname 

An appended "s" to a non alphabetic token is probabaly some form of pluraliza- 
tion, removing it and recursing on the analysis is a reasonable* thing to try. 

times and dates q 

These exist is variaous stnadardized forms many of which are easy'ra recognize 
and break down. < 

telephone numbers 

This various from country to country (and by various conventions) but there 
may be standard forms that can be recognized. 



50 



Chapter 6. Text analysis 



romain numerals 

Sometimes these are pronounced as numbers "chapter II", or as cardinals "James 
II". 

ascii art 

If you are dealing with on line text there are often extra characters in a document 
that should be ignored, or at least not pronounced literally, e.g. lines of hyphens 
used as separators. 



email a&dre&es, URLs, file names 

Depencf&^on your context this may be worth spending time on. 

tokens containing any other non-alphanumeric character 

Spliting the tefcjn around the non-alphanumeric and recursing on each part be- 
fore and after ita*£ay be reasonable. 

Remember the first nuipose of text analysis is ensure you can deal with anything, 
even if it is just sayingTjj* word "unknown" (in the appropriate language). Also its 
probabaly not worth spending time on rare token forms, though remember it not 
easy to judge what are ra^S^ind what are not. 

% 

Number pronunciation 

Almost every one will expect a syfffchesizer to be able to speech numbers. As it is not 
feasible to list all possible digit sm»gs in your lexicon. You will need to provide a 
function that returns a string of woresior a given string of digits. 

In its simplest form you should provide ^function that decodes the string of digits. 
The example spanish_number (and sp\nish_number_f rom_digits } in the released 
Spanish voice (f estvox_ellpcllk . tar .dlNis a good general example. 

Multi-token numbers C 

A number of languages uses spaces within mrSbers where English might use com- 
mas. For example German, Polish and others text may contain 

64 000 

to denote sixty four thousand. As this will be mulfSp^e tokens in Festival's basic 
analysis it is necessary to write multiple conditions in^pkr token_to_words func- 
tion. ^ 

Declensions 

In many languages, the pronunciation of a number depends on the thing that is being 
counted. For example the digit '1' in Spanish has multiple pronunciations depending 
on whether it is refering to a masculine or feminine object. In sorri^anguages this 
becomes much more complex where there are a number of possible declensions. In 
our Polish synthesizer we solved this by adding an extra argument to mj^ber gener- 
ation function which then selected the actual number word (typically th^final word 
in a number) based in the desired declension. 

o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ 
/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 

Example to be added 

o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ 
/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 



52 



Chapter 6. Text analysis 



Homograph disambiguation 

O/ O/ O/ O/ Of O/ O/ O/ O/ O/ O/ O/ O/ O/ Of o/ o/ o/ o/ o/ o/ o/ 

/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 

Discussion to be added 

o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ 
/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 

\ 

TTS modes , 

„„„,„&„„„„„ 

/O /O /O /O /O /O /O /O /O /Oy>07Y /O /O /O /O /O /O /O /O /O /O 

Discussion to be addea 

o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ rVAo/ 0/ O/ O/ O/ 0/ o/ o/ o/ 
/o /o /o /o /o /o /o /o /o /o /oxd>o /a/o /o /o /o /o /o /o /o 



>0 /Q /I 

V 

Mark-up modes 

In some situtation it ispossible foi^the user of a text-to-speech system to provide more 
information for the synthesizer thm iust the text, or the type of text. It is near impos- 
sible for TTS engines to get everytftingright all of the time, so in such situation it is 
useful to offer the developer a methcraio help guide the synthesizer in its syntehsis 
process. y>* 

Most speech synthesizer offer some sp^€?<5h method or embedded commands but 
these are specific to one interface or one API. For example the Microsoft SAPI in- 
terface allows various commands to be embedded in a text string 

However there has been a move more recently to offer a general mark up method 
that is more general. A number of people saw the^x)tential use of XML as a general 
method for marking up text for speech synthesis. The^rliest method we know was 
in a Masters thesis at Edinburgh in 1995 [isard]. This was later published under the 
name SSML. A number of other groups were alos looking at this and a large con- 
sortium formed to define this further under various n^^es STML, and eventually 
Sable. ^ 

Around the same time, more serious definitions of such a^i^rk-up were being de- 
veloped. The first to reach a well-define stage was JSML, (Java^Speech Mark-up Lan- 
guage), which covered aspects of speech recognition and gramnrats as well as speech 
synthesis mark-up. Unlike any of the other XML based markupQanguages, JSML, as 
it was embedded within Java, could define exceptions in a reasonalj le way. One of the 
problems iwth a simpel XML markup is that it is one way. You can rjetjuest a voice or 
a language or some functionality, but there is no mechanism for feedback to know if 
such a feature is actually available. \J. 

XML markup for speech have been further advances with VoiceXMPy^vhich de- 
fines a mark-up language for basic dialog systems. The speech synthesi part of the 
VoiceXML is closely follows the functionality of JSML and its predecessors. 

A new standard for markup for speech synthesis is currently being defined by W3C 
under the name SSML, confusingly the same name as the earliest example, but not 
designed to be compatible with the original, but take into account the functionaly 



52 



Chapter 6. Text analysis 



and desires of users of TTS. SSML markup is also defined as the method for speech 
synthesis markup in Microsoft's SALT tags. 

O/ O/ 0/ O/ O/ O/ 0/ O/ O/ O/ 0/ O/ O/ O/ 0/ O/ O/ O/ 0/ o/ o/ o/ 

/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 

Discussion to be added 

O/ O/ 0/ O/ 0/ O/ 0/ O/ O/ O/ 0/ O/ 0/ O/ 0/ O/ O/ O/ 0/ o/ o/ o/ 

/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 




53 



Chapter 6. Text analysis 




54 



Chapter 7. Lexicons 



This chapter covers method for finding the pronunciation of a word. This is either by 
a lexicon (a large list of words and their pronunciations) or by some method of letter 
to sound rules. 



Word pronunciations 

A pronunciation in Festival requires not just a list of phones but also a syllabic struc- 
ture. In soiw&languages the syllabic structure is very simple and well defined and 
can be unaifgSiguousry derived from a phone string. In English however this may 
not always beJ&g case (compound nouns being the difficult case). 

The lexicon structure that is basically available in Festival takes both a word and 
a part of speech (pr»d arbitrary token) to find the given pronunciation. For English 
this is probably thVogtimal form, although there exist homographs in the language, 
the word itself and a-^rairly broad part of speech tag will mostly identify the proper 
pronunciation. 

An example entry is v ^ 

("photography" ^ 

(((f@)0) ((tog)l) ((r@f)0)K$0))) 

Not that in addition to explicit marking of syllables a stress value is also given (0 or 
1). In some languages lexical is fu,try predictable, in others highly irregular. In some 
this field may be more appropriately used for an other purpose, e.g. tone type in 
Chinese. S\ 

There may be other languages which vaefuire a more complex (less complex) format 
and the decision to use some other forma^ather than this one is up to you. 

Currently there is only residual support for morphological analysis in Festival. A 
finite state transducer based analyzer for English based on the work in [ritchie92] 
is included in festival/ lib/ engmorph . scmWld festival/ lib/ engmorphsyn . scm. 
But this should be considered experimental at pi^t. Give the lack of such an analyzer 
our lexicons need to list not only based forms of words but also all their morpholog- 
ical variants. This is (more or less) acceptable in languages such as English or French 
but which languages with richer morphology sucL-as .German it may seem an un- 
necessary requirement. Agglutenative languages stsum as Finnish and Turkish this 
appears to be even more a restriction. This is probabkHpue but this current restric- 
tion not necessary hopeless. We have successfully burteLyery good letter-to-sound 
rules for German, a language with a rich morphologyvjsmich allows the system to 
properly predict pronunciations of morphological varian^sJW root words it has not 
seen before. We have not yet done any experiments with Fjjfrwsh or Turkish but see 
this technique would work, (though of course developing a properly morphological 
analyzer would be better). 

• 

Lexicons and addenda Q 

The basic assumption in Festival is that you will have a large lexicorQens of thou- 
sands of entries, that is a used as a standard part of an implementati{5f?£>f a voice. 
Letter-to-sound rules are used as back up when a word is not explicitlyHisted. This 
view is based on how English is best dealt with. However this is a very flexible view, 
An explicit lexicon isn't necessary in Festival and it may be possible to do much of 
the work in letter-to-sound rules. This is how we have implemented Spanish. How- 
ever even when there is strong relationship between the letters in a word and their 
pronunciation we still find the a lexicon useful. For Spanish we still use the lexicon 
for symbols such as "$", "%", individual letters, as well as irregular pronunciations. 

55 



Chapter 7. Lexicons 



In addition to a large lexicon Festival also supports a smaller list called an addenda 
this is primarily provided to allow specific applications and users to add entries that 
aren't in the existing lexicon. 



described above. 



Out of vocabulary words 

Because its impossible to list all words in a natural language for general text-to- 
speech you will need to provide something to pronounce out of vocabulary words. 
In some languages this is easy but in other's it is very hard. No matter what you 
do you mi/sf/provide something even if it is simply replacing the unknown word 
with the wom^unknown" (or its local language equivalent). By default a lexicon in 
Festival will tl^jexw an error if a requested word isn't found. To change this you can 
set the lts_method. Most usefully you can reset this to the name of function, which 
takes a word and^^art of speech specification and returns a word pronunciation as 

; we ase^ always going to return the word unknown but print a warning 
the the word is being 5gnojed a suitable function is 

V 

(define (mylex_lts_functiq^vord feats) 
"Deal with out of vocabulaty^word." 
(format t "unknown word:\$g\n" word) 
'("unknown" n (((uh n) 1) ((n*6t))n) 1)))) 

Note the pronunciation of "u^nmon" must be in the appropriate phone set. Also 
the syllabic structure is requiredYjjbu need to specify this function for your lexicon 
as follows 

(lex.set.lts. method 'mylex_lts_function 



At one level above merely identifying out (^vocabulary words, they can be spelled, 
this of course isn't ideal but it will allow theHBasic information to be passed over to 
the listener. This can be done with the out of vo^bulary function, as follows. 

(define (mylex_lts_function word feats) • > 

"Deal with out of vocabulary words by spelling out ttrej)2tters in the 
word." ^ 
(if (equal? 1 (length word)) \ 
(begin Q 
(format t "the character %s is missing from the lexiconY/rsord) 
'("unknown" n (((uh n) 1) ((n ou n) 1)))) V>* 
(cons 
word 

(apply O 
append S\ 
(mapcar 

(lambda (letter) • 
(car (cdr (cdr (lex.lookup letter 'n))))) s~\ 
(symbolexplode word)))))) 

A few point are worth noting in this function. This recursively caB^he lexical 
lookup function on the characters in a word. Each letter should appear in the lex- 
icon with its pronunciation (in isolation). But a check is made to ensure we don't 
recurse for ever. The symbolexplode function assumes that that letters are single 
bytes, which may not be true for some languages and that function would need to be 
replaced for that language. Note that we append the syllables of each of the letters in 
the word. For long words this might be too naive as there could be internal prosodic 
structure in such a spelling that this method would not allow for. In that case you 

56 



Chapter 7. Lexicons 



would want letters to be words thus the symbol explosion to happen at the token to 
word level. Also the above function assumes that the part of speech for letters is n. 
This is only really important where letters are homographs in languages so this can 
be used to distinguish which pronunciation you require (cf . "a" in English or "y" in 
French). 



Building letter-to-sound rules by hand 

For manyj^nguages there is a systematic relationship between the written form of 
a word ana^te pronunciation. For some languages this can be fairly easy to write 
down, by haSd, In Festival there is a letter to sound rule system that allows rules 
to be written/jrjit we also provided a method for building rule sets automatically 
which will often be more useful. The choice of using hand-written or automatically 
trained rules depends on the language you are dealing with and the relationship it 
has between its orwiqaraphy and its phone set. 



For well defined larttmages like Spanish and Croatian writting rules by hand can 
be more simple than^aining. Training requires an existing set of lexical entries to 
train from and that mayro your decision criteria. Hand written letter to sound rules 
are context dependent re^reite rules which are applied in sequence mapping strings 
letters to string of phones\mDugh the system does not explicitly care what the types 
of the strings actually will braised for. 



The basic form of the rules is 

( LC [ alpha ] RC => beta ) O 

Which is interpreter as alpha, a of one or more symbols on the input tape 

is written to beta, a string of zero or(jnore symbols on the output tape, when in the 
presence of lc, a left context of zero or ni&re input symbols, and RC a right context on 
zero or more input symbols. Note the rnp^t tape and the output tape are different, 
allthough the input and output alphabets- ne>ed not be distinct the left hand side of a 
rule only can refer to the input tape and ne^6*r to anything that has been produce by 
a right hand side. Thus rules within a rulese^cfennot "feed" or "bleed" themselves. It 
is possible to cascade multiple rule sets, but w^vyill discuss that below. 

For example to desl with the pronunciation of the letters "ch" word initially in English 
we may right two rules like this 

( # [ c h ] r => k ) 

(#[ch]=>ch) O 

To deal with words like "christmas", and "chair". Notte^e # symbol is special and 
used to denote a word boundary. LTS rule may refer to \^^d boundary but cannot 
refer to prevous or following words, you would need tc^tftchthis with some form 
of post-lexical rule (See the Section called Post-lexical rules) wKese the word is within 
some context. In the above rules we are mapping two letters c ascliz to a single phone 
k or ch. Also note the order of these rules. The first rule is m{Jre specific than the 
second. This is should appear first to deal with the specific case. # In the order were 
reversed k could never apply as the ch would cover that case too. 

Thus LTS rules should be written with the most specific cases first ansNvpically end 
in a default case. Their should be a default case for all individual letrepun the lan- 
guage's alphabet without and context restrictions mapping to some defitilt phone. 
Therefore following the above rules there would be other c rules with various con- 
texts but the final one should probably be 

( [ c ] => k ) 



57 



Chapter 7. Lexicons 



As it is a common error in writting these rules, it is worth repeating. If a rule set is to 
be universally applicable all letters in the input alphabet must have at a rule mapping 
them to some phone. 

The section to be mapped (within square brackets) and the section it is mapped into 
(after the "=>") must be items in the input and output alphabets and may not include 
sets or regular expression operators. This does mean more rules need to be explicitly 
written than you might like, but that will also help you not forget some rules that are 
required. 

For some languages it is conveninet to write a number of rules sets. For example, one 
to map the^rjput in lower case, and maybe deal with alternate treatments of accent 
characters ef^jre-write the ASCII "e"' as 

AWB: e acute 

. Also we have u^ed rule tests to post process the generated phone string to add stress 
and syllable breaks^ 

Finally some people-rave stressed that writing good letter to sound rules is hard. We 
would disagree witfrfhis, from our experience writing good letter to sound rules by 
hand is very hard ancr^m< skilled and very laborious. For anything but the simplest 
of languages writting rtfl^s by hand requires much more time that people typically 
have, and will still contairt^rrors (even with an exception list). However hand rules 
sets may be ideal in some crwsumstances. 

<> . 

Building letter-to-sound rules Automatically 



For some languages the writing ofB rule system is too difficult. Although there have 
been many valiant attempts to do \flrfpr languages like English life is basically too 
short to do this. Therefore we also intrude a method for automatically building LTS 
rules sets for a lexicon of pronunciation^ This technique has successfully been used 
from English (British and American), French and German. The difficulty and appro- 
priateness of using letter-to-sound rules h very language dependent, 



The following outlines the processes invoVed. in building a letter to sound model 
for a language given a large lexicon of pro^uhciations. This technique is likely to 
work for most European languages (including REtesian) but doesn't seem particularly 
suitable for very language alphabet languages like Japanese and Chinese. The process 
described here is not (yet) fully automatic but the*hand intervention required is small 
and may easily be done even by people with orj(y)a very little knowledge of the 
language being dealt with. 

The process involves the following steps 

• Pre-processing lexicon into suitable training set 

• Defining the set of allowable pairing of letters to phones. IVtfe.intend to do this fully 
automatically in future versions). 

• Constructing the probabilities of each letter/ phone pair. 

• Aligning letters to an equal set of phones/_epsilons_. 

• Extracting the data by letter suitable for training. (^y 

• Building CART models for predicting phone from letters (and context)). 

• Building additional lexical stress assignment model (if necessary). y ^> 
All except the first two stages of this are fully automatic. 

Before building a model its wise to think a little about what you want it to do. Ide- 
ally the model is an auxiliary to the lexicon so only words not found in the lexicon 
will require use of the letter- to-sound rules. Thus only unusual forms are likely to 
require the rules. More precisely the most common words, often having the most 



58 



Chapter 7. Lexicons 



non-standard pronunciations, should probably be explicitly listed always. It is pos- 
sible to reduce the size of the lexicon (sometimes drastically) by removing all entries 
that the training LTS model correctly predicts. 

Before starting it is wise to consider removing some entries from the lexicon before 
training, I typically will remove words under 4 letters and if part of speech informa- 
tion is available I remove all function words, ideally only training from nouns verbs 
and adjectives as these are the most likely forms to be unknown in text. It is use- 
ful to have morphologically inflected and derived forms in the training set as it is 
often such variant forms that not found in the lexicon even though their root mor- 
pheme is. wife that in many forms of text, proper names are the most common form 
of unknown^ord and even the technique presented here may not adequately cater 
for that forrrrok unknown words (especially if they unknown words are non-native 
names). This liZail stating that this may or may not be appropriate for your task but 
the rules generated by this learning process have in the examples we've done been 
much better than i\^iat we could produce by hand writing rules of the form described 
in the previous secnprf. 

First preprocess the lexicon into a file of lexical entries to be used for training, remov- 
ing functions words anadaanging the head words to all lower case (may be language 
dependent). The entrielff^hould be of the form used for input for Festival's lexicon 
compilation. Specifically ffih pronunciations should be simple lists of phones (no syl- 
labification). Depending onihe language, you may wish to remove the stressing — for 
examples here we have thokgh later tests suggest that we should keep it in even for 
English. Thus the training sefsHould look something like 

("table" nil (t ei b 1)) v^S 
("suspicious" nil (s @ s p i sh @ s)) v ^y^ 

It is best to split the data into a trailing set and a test set if you wish to know how 
well your training has worked. In our tests we remove every tenth entry and put it in 
a test set. Note this will mean our test resujis are probably better than if we removed 
say the last ten in every hundred. \ 

The second stage is to define the set of allowable letter to phone mappings irrespec- 
tive of context. This can sometimes be initially done by hand then checked against 
the training set. Initially construct a file of the 

(require 'lts_build) *w*n 

(set! allowables \_) 
'((a _epsilon_) 

(b _epsilon_) ^ 

(c _epsilon_) \J 

-" vQ^ 

(y_epsilon_) * Cv 

(z _epsilon_) \J * 

(##))) >C<J) 

All letters that appear in the alphabet should (at least) map toC^usilon_, including 
any accented characters that appear in that language. Note the la^gt \wo hashes. These 
are used by to denote beginning and end of word and are automatically added during 
training, they must appear in the list and should only map to thems^J^es. 

To incrementally add to this allowable list run festival as 

festival allowables. scm 
and at the prompt type 

festival> (cummulate-pairs "oald. train") 



59 



Chapter 7. Lexicons 



with your train file. This will print out each lexical entry that couldn't be aligned 
with the current set of allowables. At the start this will be every entry. Looking at 
these entries add to the allowables to make alignment work. For example if the fol- 
lowing word fails 



("abate" nil (ah b ey t)) 

Add ah to the allowables for letter a, b to b, ey to a and t to letter t. After doing 
that restart festival and call cummulate-pairs again. Incrementally add to the allow- 
able pairs til the number of failures becomes acceptable. Often there are entries for 
which fherel* no real relationship between the letters and the pronunciation such as 
in abbreviafiZfte and foreign words (e.g. "aaa" as "t r ih p ax 1 ey"). For the lexicons 
I've used the t^hnique on less than 10 per thousand fail in this way. 

It is worth while being consistent on defining your set of allowables. (At least) two 

mappings are po^sjljle for the letter sequence ch} havin letter c go to phone ch 

and letter h go to _^^ilon_ and also letter c go to phone _epsilon_ and letter h 
goes to ch. Howeveivordy one should be allowed, we preferred c to ch. 

It may also be the case fmt*some letters give rise to more than one phone. For example 
the letter x in English is qfren pronounced as the phone combination k and s. To allow 
this, use the multiphone k^)i Thus the multiphone k-s will be predicted for x in some 
context and the model wilV©eparate it into two phones while it also ignoring any 
predicted _epsilons_. Notejjjat multiphone units are relatively rare but do occur. 
In English, letter x give risetO'H few, k-s in taxi, g-s in example, and sometimes 
g-zh and k-sh in luxury. Othe^sVre w-ah in one, t-s in pizza, y-uw in new (British), 
ah-m in -i sm etc. Three phone muiRphone are much rarer but may exist, they are not 
supported by this code as is, but stfch entries should probably be ignored. Note the - 
sign in the multiphone examples is^aignificant and is used to identify multiphones. 

The allowables for OALD end up being 

(set! allowables 

((a _epsilon_ ei aa a e@ @ oo au o i ou ai uVeV 

(b _epsilon_ b ) Q ^ 

(c _epsilon_ k s ch sh @-k s t-s) /V\ 

(d_epsilon_ddhtjh) ^ V 

(e _epsilon_ @ ii e e@ i @@ i@ uu y-uu ou ei aa ai y Vju@ o) 

(f _epsilon_ f v ) v/'S 

(g _epsilon_ g jh zh th f ng k t) \ 

(h _epsilon_ h @ ) \( — 

(i _epsilon_ i@ i @ ii ai @@ y ai-@ aa a) (~\ 

(j _epsilon_ h zh jh i y ) 

(k _epsilon_ k ch ) 

(1 _epsilon_ 1 @-l 1-1) <H 
(m _epsilon_ m @-m n) 
(n _epsilon_ n ng n-y ) 

(o _epsilon_ @ ou o oo uu u au oi i @@ euhw u@ w-uh y-@) 
(p _epsilon_ f p v ) 
(q _epsilon_ k ) 
(r _epsilon_ r @@ @-r) 
(s _epsilon_ z s sh zh ) 
(t _epsilon_ t th sh dh ch d ) 

(u _epsilon_ uu @ w @@ u uh y-uu u@ y-u@ y-u i y-uh y-@ e) 
(v _epsilon_ v f ) 
(w _epsilon_ w uu v f u) 
(x _epsilon_ k-s g-z sh z k-sh z g-zh ) 
(y _epsilon_ i ii i@ ai uh y @ ai-@) 
(z _epsilon_ z t-s s zh ) 
(##) 
)) 



60 



Chapter 7. Lexicons 



Note this is an exhaustive list and (deliberately) says nothing about the contexts or 
frequency that these letter to phone pairs appear. That information will be generated 
automatically from the training set. 

Once the number of failed matches is significantly low enough let cummulate-pairs 
run to completion. This counts the number of times each letter/phone pair occurs in 
allowable alignments. 

Next call 



festival>^aave-table "oald-") 



with the naiS*. of your lexicon. This changes the cumulation table into probabilities 
and saves it. £^ 

Restart festival leading this new table 

festival allowables .^cm oald-pl-table.scm 

Now each word can ^^aljgned to an equally-lengthed string of phones, epsilon and 
multiphones. 

festival> (aligndata "oald.tra^nj' "oald.train.align") 

Do this also for you test set. v ^) 
This will produce entries like \^ 

aaronson _epsilon_ aa r ah n s ah n f Vj. 
abandon ah b ae n d ah n 
abate ah b ey t _epsilon_ V. 
abbe ae b _epsilon_ iy * 

The next stage is to build features suitable fsr^agon to build models. This is done by 
festival> (build-feat-file "oald.train.align" "oald.tr^sl.feats") 



Again the same for the test set. 



Now you need to construct a description file for wag£i>lor the given data. The can be 
done using the script make_wgn_desc provided withXhe speech tools 

Here is an example script for building the models, you wm.need to modify it for your 
particular database but it shows the basic processes 

for iinabcdefghijklmnopqrstuvwxyz 
do 

# Stop value for wagon ^ v, 
STOP=2 

echo letter $i STOP $STOP 

# Find training set for letter $i * 

cat oald.train.feats I ^\ 
awk '{if ($6 == '"$i"') print $0)' >ltsdataTRAIN.$i.feats (~\ 

# split training set to get heldout data for stepwise testing 

traintest ltsdataTRAIN.$i.feats Vp 

# Extract test data for letter $i 
cat oald.test.feats I 

awk '{if ($6 == "'$i'") print $0}' >ltsdataTEST.$i.feats 

# run wagon to predict model 

wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \ 
-stepwise -desc ItsOALD.desc -stop $STOP -output lts.$i.tree 

# Test the resulting tree against 



61 



Chapter 7. Lexicons 



wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ItsOALD.desc \ 
-tree lts.$i.tree 

done 

The script traintest splits the given file x into x. train and x.test with every 
tenth line in x.test and the rest in x .train. 

This script can take a significant amount of time to run, about 6 hours on a Sun Ultra 
140. 

Once the models are created the must be collected together into a single list structure. 
The trees generated by wagon contain fully probability distributions at each leaf, at 
this time thi^nf ormation can be removed as only the most probable will actually be 
predicted. Tm^ubstantially reduces the size of the tress. 

(merge models^oald Its rules "oald Its rules. scm" allowables) 

(merge_models is ^tined within lts_build. scm The given file will contain a set ! 
for the given variabi£_name to an assoc list of letter to trained tree. Note the above 
function naively assui^esthat the letters in the alphabet are the 26 lower case letters 
of the English alphabetywu will need to edit this adding accented letters if required. 
Note that adding ""' (singfe quote) as a letter is a little tricky in scheme but can be 
done — the command ( inirejsc " ' " ) will give you the symbol for single quote. 



To test a set of Its models load-the saved model and call the following function with 
the test align file „ 

festival oald-table.scm oald_lts_ru1es)scm 
festival> (lts_testset "oald.test.align '^o^d_lts_rules) 

The result (after showing all the fa^lld ones), will be a table showing the results 
for each letter, for all letters and for complete words. The failed entries may give 
some notion of how good or bad the result is, sometimes it will be simple vowel 
differences, long versus short, schwa versus full vowel, other times it may be who 
consonants missing. Remember the ultima^buality of the letter sound rules is how 
adequate they are at providing acceptable pronunciations rather than how good the 
numeric score is. ^ 

For some languages (e.g. English) it is necessary^to also find a stress pattern for un- 
known words. Ultimately for this to work well you^eed to know the morphological 
decomposition of the word. At present we provideva-CART trained system to predict 
stress patterns for English. If does get 94.6% correct {tyritn unseen test set but that isn't 
really very good. Later tests suggest that predicting sWSSsed and unstressed phones 
directly is actually better for getting whole words correc"£eyen though the models do 
slightly worse on a per phone basis [black98b]. 

As the lexicon may be a large part of the system we ha^aj so experimented with 
removing entries from the lexicon if the letter to sound^niles system (and stress 
assignment system) can correct predict them. For OALD thi^jljllows us to half the 
size of the lexicon, it could possibly allow more if a certain amount of fuzzy ac- 
ceptance was allowed (e.g. with schwa). For other languages me gain here can be 
very significant, for German and French we can reduce the lexicon by over 90%. The 
function reduce_lexicon in f estival/lib/lts_build. scm was U(Se$l to do this. A 
discussion of using the above technique as a dictionary compression,method is dis- 
cussed in [pagel98]. A morphological decomposition algorithm, like <QM described 
in [black91], may even help more. <0 

The technique described in this section and its relative merits with respect to a num- 
ber of languages /lexicons and tasks is discussed more fully in [black98b]. 



62 



Chapter 7. Lexicons 



Post-lexical rules 

In fluent speech word boundaries are often degraded in a way that causes 
co-articulation across boundaries. A lexical entry should normally provide 
pronunciations as if the word is being spoken in isolation. It is only once the word 
has been inserted into the the context in which it is going to spoken can co-articulary 
effects be applied. 

Post lexical rules are a general set of rules which can modify the segment relation 
(or any other part of the utterance for that matter), after the basic pronunciations 
have beenjfound. In Festival post-lexical rules are defined as functions which will be 
applied to utterance after intonational accents have been assigned. 

For example'KjBritish English word final /r/ is only produced when the following 
word starts wftS-h vowel. Thus all other word final /r/ s need to be deleted. A Scheme 
function that implements this is as follows 

(define (plr_rp_fin£^Hsutt) 
(mapcar >r 
(lambda (s) (A> 

(if (and (string-equalyi'yitem.name s)) ;; this is an r 

;; it is syllable finsrT—,^ 

(string-equal "1" (itgtfi.feat s "syl_final")) 

;; the syllable is word"7trial 

(not (string-equal "O'^ - .> 

(item.feat s "RiSylStyjcture.parent.sylJbreak"))) 

;; The next segment is not^vowel 

(string-equal "-" (item.feat s>n.ph_vc"))) 
(item.delete s))) ^) 
(utt.relation.items utt 'Segment))) \Q 

In English we also use post-lexicalCxiles for phenomena such as vowel reduction 
and schwa deletion in the possessive "Sr. 

Building lexicons for new languages 

Traditionally building a new lexicon for a lan^^ge was a significant piece of work 
taking several expert phonologists perhaps several years to construct a lexicon with 
reasonable coverage. However we include a mefho#here that can cut this time sig- 
nificantly using the basic technology provided with_mis documentation. 

The basic idea is add the most common words to a lexicon, expclitly giving their 
pronunciation by hand, then automatically build letter© sound rules from the initial 
data. Then finding the most common words submit th^rft to the system and check 
their correctness. If wrong they are corrected and addecHcT the lexicon, if correct they 
are added to the lexicon as is. Over multiple passes the le^icpn and letter to sound 
rules will improve. As each pass the letter to sound rulexiirp re-generate with the 
new data making them more correct. 

This tecynique has been proved succesful for a number of l^guage cutting the 
amount to time and effort to perhaps checking thousands of words rather than tens 
of thousands of words. It also is a structured method that requires only knowledge 
of the basic language to carry out. Good lexicons can be genera test, ih as little as a 
coupld of weeks, though to get greater than 95% correctness of word^uji a language 
could still take several months work. 

As stated above you can never list all the words in a language, but having grateter 
than 95% coverage with letter to sound rule accuracy grater than 75% you will have a 
lexicon that is competitive with those that take many year build. In fact because you 
can build a lexicon in a shorter time it more likely to be consistent and there better 
for synthesis. 



63 



Chapter 7. Lexicons 




64 



Chapter 8. Building prosodic models 



Phrasing 



Prosodic phrasing in speech synthesis makes the whole speech more understandable. 
Due to the size of peoples lungs there is a finite length of time people can talk before 
they can take a breath, which defines an upper bound on prosodic phrases. However 
we rarely make our phrases this maximum length and use phrasing to mark groups 
within the^peech. There is the apocryphal story of the speech synthesis example with 
an urmaturajly long prosodic phrase played at a conference presentation. At the end 
of the phrasenfte audience all took a large in- take of breathe. 

For the most cage very simple prosodic phrasing is sufficient. A comparison of var- 
ious prosodic phrasing techniques is discussed in [taylor98a], though we will cover 
some of them her^jflso. 

For English (and mt5^j) likely many other language too) simple rules based on punc- 
tuation is a very goodN^redictor of prosodic phrase boundaries. It is rare that punc- 
tuation exists where tngis^ is no boundary, but there will be a substantial number 
of prosodic boundaries ^rnich are not explicitly marked with punctuation. Thus a 
prosodic phrasing algorif(^i solely based on punctuation will typically under pre- 
dict but rarely make a false/JPljsertion. However depending on the actual application 
you wish to use the synthesizer for it may be the case that explicitly adding punc- 
tuation at desired phrase breaks is possible and a prediction system based solely on 
punctuation is adequate. 

Festival basically supports two mf^iods for predicting prosodic phrases, though any 
other method can easily be used. Nofi^that these do not necessary entail pauses in the 
synthesized output. Pauses are furtne^>redicted from prosodic phrase information. 

The first basic method is by CART tree^f test is made on each word to predict it is 
at the end of a prosodic phrase. The basip^CART tree returns b or bb (though may 
return what you consider is appropriate form break labels as long as the rest of your 
models support it). The two levels identify(aifferent levels of break, bb being a used 
to denote a bigger break (and end of utteranc-e^ 

The following tree is very simple and simph/^iydds a break after the last word of 
a token that has following punctuation. Note the first condition is done by a lisp 
function as we wand to ensure that only the la'stjtford in a token gets the break. 
(Earlier erroneous versions of this would insert bnjgjis after each word in "1984." 

(set! simple_phrase_cart_tree 

((lisp_token_end_punc in ("?" "." ":")) 
((BB)) <ft 
((lisp_token_end_punc in (""' "\"" "," ";")) 
((B)) 



<6 

((n.name is 0) ;; end of utterance (^) 
((BB)) 
((NB)))))) 



This tree is defined festival/lib/phrase . scm in the standard distribution and is 
certainly a good first step in defining a phrasing model for a new language. 

To make a better phrasing model requires more information. As the (jjafcic punctu- 
ation model underpredicts we need information that will find reasorf^Me bound- 
aries within strings of words. In English, boundaries are more likely between content 
words and function words, because most function words are before the words they 
related to, in Japanese function words are typically after their relate content words 
so breaks are more likely between function words and content words. If you have 
no data to train from, written rules, in a CART tree, can exploited this fact and give 
a phrasing model that is better than a punctuation only. Basically a rule could be if 



65 



Chapter 8. Building prosodic models 



the current word is a content word and the next is a function word (or the reverse if 
that appropriate for a language) and we are more than 5 words from a punctuation 
symbol them predict a break. We maybe also want to insure that we are also at least 
five words from predicted break too. 

Note the above basic rules aren't optimal but when you are building a new voice in a 
new language and have no data to train from you will get reasonably far with simple 
rules like that, such that phrasing prediction will be less of a problem than the other 
problems you will find in you voice. 

To implement such a scheme we need three basic functions: one to determine if the 
current wo&Ms a function of content word, one to determine number of words since 
previous pufjfisuation (or start of utterance) and one to determine number of words to 
next punctuati^ (or end of utterance. The first of these functions is already provided 
for with a feature, through the feature function gpos. This uses the word list in the 
lisp variable gulss pos to determine the basic category of a word. Because in most 
languages the sef^l function words is very nearly a closed class they can usually 
be explicitly listed. vPfte format of the guess_pos variable is a list of lists whose first 
element is the set nimaand the rest of the list if the words that are part of that set. 
Any word not a memtjerof any of these sets is defined to be in the set content. For 
example the basic defirimi&n for this for English, given in f estival/lib/pos . scm is 

(set! english_guess_pos (VN 

'((in of for in on that witrVm^at from as if that against about 

before because if under afteLaver into while without 

through new between amonguhfil per up down) 
(to to) Va. 

(det the a an no some this that eaa^another those every all any 

these both neither no many) *Cy 
(md will may would can could shouratirust ought might) 
(cc and but or plus yet nor) \ 
(wp who what where how when) * 
(pps her his their its our their its mine) J< 
(aux is am are was were has have had be) \ . 
(punc "." "," ":" ";" "\ (" "?" ")" "!") <\ 
)) \A 

The punctuation distance check can be writtef^s a Lisp feature function 

(define (since_punctuation word) Y\ 
"(since_punctuation word) \ 
Number of words since last punctuation or beginning o^utterance." 
(cond (~\ 
((null word) 0) ;; beginning or utterance 

((string-equal "0" (item.feat word "p.lisp_token_end_punc > M^fe) 

(t tft 

(+ 1 (since_punctuation (item.prev word)))))) 

The function looking forward would be (~) 

(define (until_punctuation word) 
"(until_punctuation word) 

Number of words until next punctuation or end of utterance." ^\ 
(cond 

((null word) 0) ;; beginning or utterance 

((string-equal "0" (token_end_punc word)) 0) Vp 

(t ^ 
(+ 1 (since_punctuation (item.prev word)))))) 



The whole tree using these features that will insert a break at punctuation or be- 
tween content and function words more than 5 words from a punctuation symbol is 
as follows 



66 



Chapter 8. Building prosodic models 



(set! simple_phrase_cart_tree_2 

((lisp_token_end_punc in ("?" "." ":")) 
((BB)) 

((lisp_token_end_punc in (""' "\"" "," ";")) 
((B)) 

((n.name is 0) ;; end of utterance 
((BB)) 

((lisp_since_punctuation > 5) 
((lisp_until_punctuation > 5) 
((gpo^fs content) 
((n.gp^pontent) 
«NB)>£» 

((B))) ;; gbi content so a function word 
((NB))) ^ris is a function word 
((NB))) ;; te close to punctuation 
((NB))) ;; to soran after punctuation 
((NB)))))) V3^ 

To use this add the aWye to a file in your f estvox/ directory and ensure it is loaded 
by your standard voiceM^. In your voice definition function. Add the following 



(set! guess_pos english_g.uess_pos) ;; or appropriate for your language 




(Parameter.set Thrasejvlefhe*!. 'cart_tree) 

(set! phrase_cart_tree simple_*pnrase_cart_tree_2) 

A much better method for predictjrrtS phrase breaks is using a full statistical model 
trained from data. The problem is tnairyou need a lot of data to train phrase break 
models. Elsewhere in this documents wesuggest the use of a timit style database 
or around 460 sentences, (around 1456@; segments) for training models. However a 
database such as this as very few internatmterance phrase breaks. An almost perfect 
model word predict breaks at the end of ea£h utterances and never internally. Even 
the f2b database from the Boston University-Radio New Corpus [ostendorf95] which 
does have a number of utterance internal l^reaks isn't really big enough. For En- 
glish we used the MARSEC database [roach93J^hich is much larger (around 37,000 
words). Finding such a database for your language will not be easy and you may 
need to fall back on a purely hand written rule sy'stgih. 

Often syntax is suggested as a strong correlate orpro^odic phrase. Although there 
is evidence that it influences prosodic phrasing\^there are notable exceptions 
[bachenko90]. Also considering how difficult it is to(g^t a reliable parse tree it is 
probably not worth the effort, training a reliable parsef-is non-trivial, (though we 
provide a method for training stochastic context free grammars in the speech tools, 
see manual for details). Of course if your text to be synfiaexized is coming from a 
language system such as machine translation or languagexg^meration then a syntax 
tree may be readily available. In that case a simple rule meshanism taking into 
account syntactic phrasing may be useful ^-^v 

When only moderate amounts of data are available for training a simple CART tree 
may be able to tease out a reasonable model. See [hirschberg94] fcTr some discussion 
on this. Here is a short example of building a CART tree for phrase p^reVliction. Let us 
assume you have a database of utterances as described previously. By^Qpvention we 
build models in directories under festival/ in the main database direc^fcdry. Thus let 
US Create festival/phrbrk. \P 

First we need to list the features that are likely to be suitable predictors for phrase 
breaks. Add these to a file phrbrk . feats, what goes in here will depend on what 
you have, full part of speech helps a lot but you may not have that for your language. 
The gpos described above is a good cheap alternative. Possible features may be 



67 



Chapter 8. Building prosodic models 



word_break 

lisp_token_end_punc 

lisp_until_punctuation 

lisp_since_punctuation 

p.gpos 

gpos 

n.gpos 

Given this list you can extract features form your database of utterances with the 
Festival script dump feats 

dumpfeats^val ../../festvox/phrbrk.scm -feats phrbrk. feats \ 
-relation Wcg*i^-output phrbrk.data .. / utts/*.utts 

f estvox/pljrbrk . scm should contain the definitions of the function 
until_punctuatiqna, since_punctuation and any other Lisp feature functions you 
define. \^ 

Next we want to sptfL this data into test and train data. We provide a simple shell 
script called traintes^r which splits a given file 9:1, i.e every 10th line is put in the 
test set. y< j 

traintest phrbrk.data 

As we intend to run wagon ART tree builder on this data we also need create the 
feature description file for the dat^rThe feature description file consists of a bracketed 
list of feature name and type. Typ&may be int float or categorical where a list of 
possible values is given. The script make_wagon_de so (distributed with the speech 
tools) will make a reasonable approximation for this file 

make_wagon_desc phrbrk.data phrbrk.fe^s phrbrk.desc 

This script will treat all features as categorical. Thus any float or int features will 
be treated categorically and each value fdurtd in the data will be listed as a sepa- 
rate item. In our example lisp_since_punct^Sition and lisp_until_punctuation 
are actually float (well maybe even int) but ttew will be listed as categorically in 
phrbrk . desc, something like 



\5 



(lisp_since_punctuation 

i o 



o. 



4 

3 vS* 

5 
6 
7 

8) fA 
You should change this entry (by hand) to be ^\ 

o 



(lisp_since_punctuation float ) 



The script cannot work out the type of a feature automatically so you must make 
this decision yourself. 

Now that we have the data and description we can build a CART tree. The basic 
command for wagon will be 



68 



Chapter 8. Building prosodic models 



wagon -desc phrbrk.desc -data phrbrk.data. train -test phrbrk. data.test \ 
-output phrbrk.tree 

You will probably also want to set a stop value. The default stop value is 50, which 
means there must be at least 50 examples in a group before it will consider looking 
for a question to split it. Unless you have a lot of data this is probably too large and a 
value of 10 to 20 is probably more reasonable. 

Other arguments to wagon should also be considered. A stepwise approach where 
all features are tested incrementally to find the best set of features which give the 
best tree c^ugive better results than simply using all features. Though care should 
be taken witj# this as the generated tree becomes optimized from the given test set. 
Thus a furth3H*eld our test set is required to properly test the accuracy of the result. 
In the stepwis^ase it is normal to split the train set again and call wagon as follows 

• 

traintest phrbrk. d^.train 

wagon -desc phren^^sc -data phrbrk.data. train.train \ 

-test phrbrk. data.train.test \ 

-output phrbrk.treertfepwise 
wagon_test -data phrark^^ta.test -desc phrbrk.desc \ 

-tree phrbrk.tree ^ 

Stepwise is particularly u^saful when features are highly correlated with themselves 
and its not clear which is bsSTjgeneral predictor. Note that stepwise will take much 
longer to run as it potentially'Sust build a large number of trees. 

Other arguments to wagon can considered, refer to the relevant chapter in speech 
tools manual for their details. O 

However it should be noted that^khout a good intonation and duration model 
spending time on producing good phrasing is probably not worth it. The quality of 
all these three prosodic components is closely related such that if one is much better 
than there may not be any real benefit. \ y, 

% 

Accent/Boundary Assignment 

Accent and boundary tones are what we will hopefully in a theory independent 
way, to refer to the two main types of intonation event. For English, and for many 
other languages the prediction of position of the aeemts and boundaries can be done 
as an independent process from F0 contour generation ijself . This is definite true from 
the major theories we will be considering. *\ 



As with phrase break prediction there are some simple-rules that will go a surpris- 
ingly long way. And as with most of the other statisticakjaarning techniques simple 
rules cover most of the work, more complex rules work be^J;, but the best results are 
from using the sorts of information you were using in rulesj&it statistically training 
them from a appropriate data. ^ 

For English the placement of accents on stressed syllables in^d^jeontent words is a 
quite reasonable approximation achieving about 80% accuracy ®yi typical databases. 
[hirschberg90] is probably the best example of a detailed rule dr^en approach (for 
English). CART trees based on the sorts of features Hirschberg uses pre quite reason- 
able. Though eventual these rules become limiting and a richer knowledge source is 
required to assign accent patterns to complex nominals (see [sproat9C 



However all these techniques quickly come to the stumbling block that atfttough sim- 
ple so-called discourse neutral intonation is relatively easy achieve, achieving realis- 
tic, natural accent placement is still beyond our synthesis systems (though perhaps 
not for much longer). 

The simplest rule for English may be reasonable for other languages. There are even 
simpler solutions to this, such as fixed prosody, or fixed declination, but apart from 
debugging a voice these are simpler than is required even for the most basic voices. 



Chapter 8. Building prosodic models 



For English, adding a simple hat accent on lexically stressed syllables in all content 
words works surprisingly well. To do this in Festival you need a CART tree to predict 
accentedness, and rules to add the hat accent (though we will leave out FO generation 
until the next section). 

A basic tree that predicts accents of stressed syllables in content words is 



(set! simple_accent_cart_tree 
(R:SylS^fture.parent.gpos is content) 




The above tree simplv distinguishes accented syllables from non-accented. In the- 
ories like ToBI [silver^anp2], a number of different types of accent are supported. 
ToBI, with variations, hc5s)been applied to a number of languages and may be suit- 
able for yours. Howeverf/ertthough accent and boundary types have been identified 
for various languages ancVcualects, a computational mechanism for generating and 
FO contour from an accent specification often has not yet been specified (we will dis- 
cuss this more fully below), v^) 

If the above is considered tooN^ive a more elaborate hand specified tree can also 
be written, using relevant factors^Wobably similar to those used in [hirschberg90]. 
Following that, training from data is the next option. Assuming a database exists 
and has been labeled with discrete Accent classifications, we can extract data from it 
for training a CART tree with wagonj(We will build the tree in festival/accents/. 
First we need a file listing the features thjWare felt to affect accenting. For this we will 
predict accents on syllables as that has beeftrused for the English voices created so far, 
but there is an argument for predict accen^placement on a word basis as although 
accents will need to be syllable aligned, Wsnj£h syllable in a word gets the accent is 
reasonably well defined (at least compared wi&i predicting accent placement). 

A possible list of features for accent prediction(^)put in the file accent . feats. 

R:Intonation.daughterl.name *f e \ 
R:SylStructure.parent.R:Word.p.gpos v> . 

R:SylStructure.parent.gpos 

R:SylStructure.parent.R:Word.n.gpos f~\ 
ssyl_in 

syl_in v>^ 



ssyl_out 
syl_out 
p. stress 
stress 
n. stress 
pp.syljbreak 

p.syl_break 

syl_break " 
n.syl_break * 
nn.syljbreak /~\ 
pos_in_word 

position_type <0 
We can extract these features from the utterances using the Festival script dump feats 



dumpfeats -feats accent.feats -relation Syllable \ 
-output accent.data ../utts/*.utts 



70 



Chapter 8. Building prosodic models 



We now need a description file for the features which can be approximated by the 
speech tools script make_wagon_desc 

make_wagon_desc accent.data accent.feat accent.desc 

Because this script cannot determine if a feature is categorical, if takes an range of 
values you must hand edit the output file and change any feature to float or int if 
that is what it is. 

The next stage is to split the data into training and test sets. If stepwise training is to 
be used fd^building the CART tree (which is recommended) then the training data 
should be fumier split 

tramtest accent.data 
traintest accent.tlata. train 

Deciding on a stoj^-ralue for training depends on the number of examples, though 
this can be tuned to ensure over-training isn't happening. 

wagon -data accent.datSjjWi.train -desc accent.desc \ 

-test accent.data. train. tetff^fctop 10 -stepwise -output accent.tree 
wagon_test -data accent.data-test -desc accent.desc \ 

-tree accent.tree vv 

This above is designed to predic^atecents, and similar tree should be used to predict 
boundary tones as well. For the mtJsLpart intonation boundaries are defined to occur 
at prosodic phrase boundaries so NjaVtask is somewhat easier, though if you have 
a number of boundary tone types iirf'your inventory then the prediction is not so 
straightforward. 

When training ToBI type accent types it vs^iot easy to get the right type of variation 
in the accent types. Although some ToBFiabels have been associated with semantic 
intentions and including discourse informa^fckm.has been shown help prediction (e.g. 
[black97aj], getting this acceptably correct is^ot easy. Various techniques in modify- 
ing the training data do seem to help. Because/^ the low incidence of "L*" labels in 
at least the f2b data, duplicating all sample points in the training data with L's does 
increase the likelihood of prediction and does s^erruto give a more varied distribu- 
tion. Alternatively wagon returns a probability distribution for the accents, normally 
the most probable is selected, this could be modified^to select from the distribution 
randomly based on their probabilities. q 

Once trees have been built they can be used in a voices /SSv follows. Within the voice 
definition function 

(set! int_accent_cart_tree simple_accent_cart_tree) '^\J 
(set! int_tone_cart_tree simple_tone_cart_tree) f~\ 
(Parameter.set Tnt_Method Intonation_Tree) 

or if only one tree is required you can use the simpler intonation jnethod 

o 

(set! int_accent_cart_tree simple_accent_cart_tree) /~\ 
(Parameter.set Tnt_Method Intonation_Simple) ^**^ > 



72 



Chapter 8. Building prosodic models 



FO Generation 



Predicting where accents go (and their types) is only half of the problem. We also have 
build an FO contour based on these. Note intonation is split between accent placement 
and FO generation as it is obvious that accent position influences durations and an 
FO contour cannot be generated without knowing the durations of the segments the 
contour is to be generated over. 

There are three basic FO generation modules available in Festival, though others 
could be added, by general rule, by linear regression /CART, and by Tilt. 



FO by rul^A 



The first is designed to be the most general and will always allow some form of FO 
generation. This«method allows target points to be programmatically created for each 
syllable in an uttar&nce. The idea follows closely a generalization of the implemen- 
tation of ToBI typV^^ents in [anderson84], where n-points are predicted for each 
accent. They (and osiers in intonation) appeal to the notion of baseline and place 
target FO points abov£^)ahd below that line based on accent type, position in phrase. 
The baseline itself is of£en defined to decline over the phrase reflecting the general 
declination of FO over tym^ 

The simple idea behind this^eneral method is that a Lisp function is called for each 
syllable in the utterance. TmtLisp function returns a list of target FO points that lie 
within that syllable. Thus the^enerality of this methods actual lies in the fact that it 
simply allows the user to prog^m* anything they want. For example our simple hat 
accent can be generated using thj 1 &4echnique as follows. 

This fixes the FO range of the speaker so would need to be changed for different 
speakers. ^ 

(define (targ_funcl utt syl) \^ 

"(targ_funcl UTT STREAMITEM) S\ 
Returns a list of targets for the given syllable.^* 
(let ((start (item.feat syl 'syllable_start)) V^v- 
(end (item.feat syl 'syllable_end))) q\ 
(if (equal? (item.feat syl "R:Intonation.daughteEfcname") "Accented") 
(list \V 
(list start 110) , 
(list (/ (+ start end) 2.0) 140) w< 
(list end 100))))) W v 

y 

It simply checks if the current syllable is accented arid if so returns a list of posi- 
tion/ target pairs. A value at the start of the syllable or TfBHz, a value at 140Hz at the 
mid-point of the syllable and a value of 100 at the end. 

This general technique can be expanded with other rules^^necessary Festival in- 
cludes an implementation of ToBI using exactly this tecrtf^qhe, it is based on the 
rules described in [jilka96] and in the file f estival/lib/tob:/~5o ■ scm. 

FO by linear regression * ^ 

This technique was developed specifically to avoid the difficult decisions of exactly 
what parameters with what value should be used in rules like those orrajoderson84]. 
The first implementation of this work is presented [black96]. The idea i&Xo find the 
appropriate F0 target value for each syllable based on available features by training 
from data. A set of features are collected for each syllable and a linear regression 
model is used to model three points on each syllable. The technique produces rea- 
sonable synthesis and requires less analysis of the intonation models that would be 
required to write a rule system using the general F0 target method described in the 
previous section. 



72 



Chapter 8. Building prosodic models 



However to be fair, this technique is also much simpler and there are are obviously a 
number of intonational phenomena which this cannot capture (e.g. multiple accents 
on syllables and it will never really capture accent placement with respect to the 
vowel). The previous technique allows specification of structure but without explicit 
training from data (though doesn't exclude that) while this technique imposes almost 
no structure but depends solely on data. The Tilt modeling discussed in the following 
section tries to balance these two extremes. 

The advantage of the linear regression method is very little knowledge about the 
intonation the language under study needs to be known. Of course if there is knowl- 
edge and ^gories it is usually better to follow them (or at least find the features 
which influeAe the FO in that language). Extracting features for FO modeling is sim- 
ilar to extracting features for the other models. This time we want the means FO at 
the start midaE-tad end of each utterance. The Festival features syl_startpitch, 
syl_midpitch afid syl_endpitch proved this. Note that syl_midpitch returns the 
pitch at the mid o^jhe vowel in the syllable rather than the middle of the syllable. 

For a linear regression model all features must be continuous. Thus features which 
are categorical that influence FO need to be converted. The standard technique for 
this is to introduce neSjrjgatures, one for each possible value in the class and output 
values of 0 or 1 for thes'einodified features depending on the value of the base fea- 
tures. For example in a ToST\environment the output of the feature tobi_accent will 
include h*, l*, l+h* etc. fnAe modified form you would have features of the form 
tobi_accent_H*, tobi_aco€Trt_L*, tobi_accent_L_H*, etc. 

The program o l s in the speechropls takes feature files and description files in exactly 
the same format as wagon, except that all feature must be declared as type float. 
The standard ordinary least squads algorithm used to find the coefficients cannot, 
in general, deal with features that ana directly correlated with others as this causes a 
singularity when inverting the matnx.£he solution to this is to exclude such features. 
The option -robust enables that thctugh at the expense of a longer compute time. 
Again like file a stepwise option is invaded so that the best subset of features may 
be found. ^\ 

The resulting models may be used by the^nj:_Targets_LR module which takes its 
LR models from the variables f o_lr_starr^Jf_lr_mid and f o_lr_end. The output 
of ols is a list of coefficients (with the Intercept first). These need to be converted to 
the appropriate bracket form including their fd(p)Ure names. An example of which is 

in f estival/lib/f 2bf Olr . scm. 

If the conversion of categoricals to floats seems to tfnifix work or would prohibitively 
increase the number of features you could use wagojMo generate trees to predict 
FO values. The advantage is that of a decision tree over the LR model is that it can 
deal with data in a non-linear fashion, But this is also/the disadvantage. Also the 
decision tree technique may split the data sub-optimalbj£~Jhe LR model is probably 
more theoretically appropriate but ultimately the resultsdepend on how goods the 
models sound. 

Dump features as with the LR models, but this time there-ds no need convert 
categorical features to floats. A potential set of features to do-mis from (substitute 

syl_midpitch and syl_endpitch for the other two models is 




O 



syl_endpitch 
pp.tobi_accent 

p.tobi_accent \J 
tobi_accent 
n.tobi_accent 
nn.tobi_accent 
pp.tobi_endtone 
R: Syllable .p . tobi_endtone 
tobi_endtone 
n.tobi_endtone 
nn.tobi_endtone 
pp.syljbreak 



73 



Chapter 8. Building prosodic models 



p.syl_break 

syl_break 

n.syl_break 

nn.syl_break 

pp. stress 

p.stress 

stress 

n. stress 

nn.stress 

syl_in 

syl_out A 

ssyl_in 

ssyl_out 

asyl_in A 

asyl_out 

last_accent 

next_accent , r\ 

sub_phrases 

The above, of coursera&sumes a ToBI accent labeling, modify that as appropriate for 
you actually labeling. 

Once you have generate'cWhree trees predicting values for start, mid and end points 
in each syllable you will need to add some Scheme code to use these appropriately. 
Suitable code is provided inQJc/intonation/tree_f 0 . scmyou will need to include 
that in your voice. To use it\a's)the intonation target module you will need to add 
something like the following roypur voice function 

(set! FOstart_tree f2b_F0start_tree) V v^ 
(set! FOmid_tree f2b_F0mid_tree) \§i 
(set! FOend_tree f2b_F0end_tree) s\ 
(set! int_params >• . , 

'((target_fO_mean 110) (target_fO_std 10>< 



(model_fO_mean 170) (model_fO_std 40)) 
(Parameter.set 'Int_Target_Method IntJIargets^Tree) 

The int_params values allow you to use m& model with a speaker of a different 
pitch range. That is all predicted values are concerted using the formula 

(+ (* (/ (- value model_fO_mean) model_fO_stdde^) > 
target_fO_stddev) target_fO_mean))) v (_) 

Or for those of you who can't real Lisp expressions^^^^ 

((value - model_fO_mean) / model_fO_stddev) * target_fQ^idev)+ 
target_fO_mean ■^Jy 

The values in the example above are for converting a > feCakle speaker (used for 
training) to a male pitch range. 

Tilt modeling * q 

Tilt modeling is still under development and not as mature as the otkeS methods as 
described above, but it potentially offers a more consistent solution to^Hte problem. 
A tilt parameterization of a natural F0 contour can be automatically delved from a 
waveform and a labeling of accent placements (a simple "a" for accents and "b" of 
boundaries) [taylor99]. Further work is being done on trying to automatically find 
the accents placements too. 

For each "a" in an labeling four continuous parameters are found: height, duration, 
peak position with respect to vowel start, and tilt. Prediction models may then be 
generate to predict these parameters which we feel better capture the dimensions 

74 



Chapter 8. Building prosodic models 



of FO contour itself. We have had success in building models for these parameters, 
[dusterhoff97a], with better results than the linear regression model on comparable 
data. However so far we have not done any tests with Tilt on languages other than 
English. 

The speech tools include the programs tilt_analyse and tilt_synthesize to aid 
model building but we do not yet include fill Festival end support for using the gen- 
erated models. 



Duration ^ 

Like the abovt^pVosody phenomena, very simple solutions to predicting durations 
work surprisingly well, though very good solutions are extremely difficult to achieve. 

Again the basic s^t&tegy is assigning fixed models, simple rules models, complex 
rule modules, and rrm^ed models using the features in the complex rule models. The 
choice of where to stapjdepends on the resources available to you and time you wish 
to spend on the problarufiiven a reasonably sized database training a simple CART 
tree for durations achieves quite acceptable results. This is currently what we do for 
our English voices in Festj^l. There are better models out there but we have not fully 
investigated them or included easy scripts to customize them. 



The simplest model for dura^fki is a fixed duration for each phone. A value of 100 
milliseconds is a reasonable Sffari^This type of model is only of use at initial test- 
ing of a diphone database beyofidihat it sounds too artificial. The Festival function 
SayPhones uses a fixed duration v mbdel, controlled by the value (in ms) in the vari- 
able FP_duration. Although thereto a fixed duration module in Festival (see the 
manual) its worthwhile starting off vvjjh something a little more interesting. 

The next level for duration models is tp use average durations for the phones. Even 
when real data isn't available to calcuMe^verages, writing values by hand can be 
acceptable, basically vowels are longer tjjan consonants, and stops are the shortest. 
Estimating values for a set of phones can ^5e>done by looking at data from another 
language, (if you are really stuck, see f es^T>^(al/lib/mrpa_durs . scm}, to get the 
basic idea of average phone lengths. >• _ 

In most languages phones are longer at the phrase final and to a lesser extent phrase 
initial positions. A simple multiplicative factor t:ambe defined for these positions. 
The next stage from this is a set of rules that modify the basic average based on the 
context they occur in. For English the best definition^bf such rules is the duration 
rules given in chapter 9, [allen87] (often referred to as- the Klatt duration model). The 
factors used in this may also apply to other languages. wsimplified form of this, that 
we have successfully used for a number of languages, ,£^1 is often used as our first 
approximation for a duration rule set is as follows. 

Here we define a simple decision tree that returns a multjj^K^tion factor for a seg- 
ment 



o 



(set! simple_dur_tree 



((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial 
((R:SylStructure.parent.stress is 1) v. 

((1-5)) C> 

((1-2))) 

((R:SylStructure.parent.syl_break > 1) ;; clause final <S 
((R:SylStructure.parent.stress is 1) 
((1-5)) 
((1-2))) 

((R:SylStructure.parent.stress is 1) 
((ph_vc is +) 

((1-2)) 

((1.0))) 



75 



Chapter 8. Building prosodic models 



((1-0)))))) 

You may modify this adding more conditions as much as you want. In addition to 
the tree you need to define the averages for each phone in your phone set. For reasons 
we will explain below the format of this information is "segname 0.0 average" as in 

(set! simple_phone_data 
'( 

(# 0.0 0.250) 
(a 0.0 0M0) 

(e o.o offeej 

(i 0.0 O.OTgtK 
(o 0.0 0.080V 
(u 0.0 0.070^-* 
(iO 0.0 0.040) , 

f <$> 

With both these expressions loaded in your voice you may set the following in 
your voice definition wmetion. setting up this tree and data as the standard and the 
appropriate duration moerule. 

;; Duration prediction 
(set! duration_cart_tree simple<dur_tree) 
(set! durarion_ph_info simple^phone_data) 
(Parameter.set 'Duration_Methoer r Tree_ZScores) 

o 

Though in your voice use voice specific names for the simple_ variables otherwise 
you may class with other voices. \j 

It has been shown [campbell91] that abetter representation for duration for modeling 
is zscores, that is number of standard deviations from the mean. The duration module 
used in the above is actually designed tcvft}ke a CART tree that returns zscores and 
uses the information in duration_ph_inf ojo change that into an absolute duration. 
The two fields after the phone name are me^rVand standard deviation. The interpre- 
tation of this tree and this phone info happeirtsMo give the right result when we use 
the tree to predict factors and have the stddev/fesld contain the average duration, as 
we did above. ^ 

However no matter if we use zscores or absolute&^better way to build a duration 
model is to train from data rather than arbitrarily selecting modification factors. 

Given a reasonable sized database we can dump mirations and features for each 
segment in the database. Then we can train a modelvuaing those samples. For our 
English voices we have trained regression models usin^^gon, though we include 
the tools for linear regression models too. 

An initial set of features to dump might be 

segment_duration ^"^v 
name \ 
p.name B 
n.name 

R:SylStructure.parent.syl_onsetsize v 

R:SylStructure.parent.syl_codasize 

R:SylStructure.parent.R:Syllable.n.syl_onsetsize 

R:SylStructure.parent.R:Syllable.p.syl_codasize <^ 

R:SylStructure.parent.position_type 

R:SylStructure.parent.parent.word_numsyls 

pos_in_syl 

syl_initial 

syl_final 

R:SylStructure.parent.pos_in_word 
p.seg_onsetcoda 



76 



Chapter 8. Building prosodic models 



seg_onsetcoda 
n. seg_onsetcoda 
pp.ph_vc 
p.ph_vc 
ph_vc 
n.ph_vc 
nn.ph_vc 
pp.ph_vlng 
p.ph_vlng 
ph_vlng 
n.ph_vlng* 
m\.ph_vfo(g\ 
pp.ph_vhe*gijt 
p .ph_vheign^£ 
ph_vheight 
n.ph_vheight 



nn.ph_vheight 



pp.ph_vfront 



ph_vfront ' p/\ 

n.ph_vfront *^<^ 
nn.ph_vfront \J 
pp.ph_vrnd /V\ 
p.ph_vrnd /r\ 
ph_vrnd QO 
n.ph_vrnd ^\ 
nn.ph_vrnd v* 

pp.ph_ctype \ 
p.ph_ctype v^*) 
ph_ctype S 
n.ph_ctype \J * 

nn.ph_ctype S\ 
pp.ph_cplace \ * 

p.ph_cplace *\ 
ph_cplace S\ 
n.ph_cplace > 
nn.ph_cplace r^v- 
pp.ph_cvox S\ 



p.ph_cvox 
ph_cvox 

n.ph_cvox , 
nn.ph_cvox ^< 
R:SyrStructure.parent.R:Syllable.pp.syl_break ^-J * 

R:SylStructure.parent.R:Syllable.p.syl_break 
R: SylStructure .p arent.syl_break s~\ 
R:SylStructure.parent.R:Syllable.n.syl_break 
R:SylStructure.parent.R:Syllable.nn.syl_break . \ \ 

R:SylStructure.parent.R:Syllable.pp. stress ^ r\ 

R: SylStructure .p arent. R: Syllable .p .stress 
R:SylStructure.parent.stress 

R:SylStructure.parent.R:Syllable.n.stress (~\ 
R:SylStructure.parent.R:Syllable.nn.stress 
R: SylStructure .p arent.syl_in 
R:SylStructure.parent.syl_out 
R:SylStructure.parent.ssyl_in 

R:SylStructure.parent.ssyl_out * 
R:SylStructure.parent.parent.gpos 

By convention we build duration models in f estival/dur/. We will sa^the above 
feature names in dur . f eatnames. We can dump the features with the command 



?6 



dumpfeats -relation Segment -feats dur.featnames -output dur.feats \ 
../utts/*.utt 



77 



Chapter 8. Building prosodic models 



This will put all the features in the file dur . feats. For wagon we need to build a fea- 
ture description file, we can build a first approximation with the make_wagon_desc 
script available with the speech tools 



make_wagon_desc dur.feats dur.featnames dur.desc 

You will then need to edit dur.desc to change a number of 
features from their categorical list (lots of numbers) into type float. 
Specifically for the above list the features segment_duration, 
R : SylStr^uture .parent .parent . word_numsyls, pos_in_syl, 
R : Syl Structure .parent . pos_in_word, R:SylStructure.parent. syl_in, 

R : SylStrucZAe . parent . syl_out, R: SylStructure .parent . ssyl_in and 
R: Syistructi^a. parent . ssyl_out should be declared as floats. 

We then need to> split the data into training and test sets (and further split the train 
set if we are goingj^ use stepwise CART building. 

V 

traintest dur.feats 
traintest dur.feats.trait^V » 

We can no build a mod?|^ing wagon 
wagon -data dur.feat.train.trSij-desc dur.desc \ 



in -uaia uur.ieai.iram.iramj-uesu uur.uesc \ 
test dur.feats.train.test -stop 10 -stepwise \ 
-output dur.10.tree .>* 
wagon_test -data dur.feats. test -treed ur. 10. tree -desc dur.desc 



You may wish to remove all exarmf^es of silence from the data as silence durations 
typically has quite a different distrimi'tjon from other phones. In fact it is common 
that databases include many examples oi silence which are not of natural length as 
they are arbitrary parts of the initial an^T following silence around the spoken utter- 
ances. Their durations are not something^tnat should be trained for. 

These instructions above will build a tree ma.t predicts absolute values. To get such 
a tree to work with the zscore module simply, make the stddev field above 1. As 
stated above using zscores typically give bertero'esults. Although the correlation of 
these duration models in the zscore domain m^y not be as good as training models 
predicting absolute scores when those predicted^cores are convert back into the ab- 
solute domain we have found (for English) that th&rarrelations are better, and RMSE 
smaller. 




In order to train a zscore model you need to convert fhe-absolute segment durations, 
to do that you need the means and standard deviatioas for each segment in your 
phoneset. 

There is a whole branch of possible mappings for the dis&jibution of durations: zs- 
cores, logs, logs-zscores, etc or even more complex functi^rfs)[bellegarda98]. These 
variations do give some improvements. The intention is to map\the distribution to a 
normal distribution which makes it easier to learn. 

Other learning techniques, particularly Sums of Products model" ([sproat98] chapter 
5), which has been shown to training better even on small amounft of data. 

Another technique, which although shouldn't work is to borrow avnpdels trained 
for another language for which data is available. Actually the duratis!rj)rnodel used 
in Festival for the US and UK voices is the same, it was in fact trairj^ from the 
f2b database, a US English database. As the phone sets are different for'US and UK 
English we trained the models using phonetic features rather than phone names, 
and trained them in the zscore domain keeping the actual phone names and means 
and standard deviations separate. Although the models were slightly better if we in- 
cluded the phone names themselves, it was only slightly better and the models were 
also substantially larger (and took longer to train). Using the phonetic feature offers a 



78 



Chapter 8. Building prosodic models 



more general model (it works for UK English), more compact, quicker learning time 
and with only a small cost in performance. 

Also in the German voice developed at OGI, the same English duration model was 
used. The results are acceptable and are at least better than any hand written rule 
system that could be written. Improvements in that model are probably only possible 
by training on real German data. Note however such cross language borrowing of 
models is unlikely to work in general but there may be cases where it is a reasonable 
fall back position. 



Note that the awove descriptions are for the easy implementation of prosody models 
which unfortunately means that the models will not be perfect. Of course no models 
will be perfect butSyith some work it is often possible to improve the basic models 
or at least make tn^aC more appropriate to the synthesis task. For example if your 
intend use of your synthesis voice is primarily for dialog systems training one news 
caster speech will not^w&the best effect. Festival is designed as a research system as 
well as tool to build languages so it is well adapted to prosody research. 

One thing which clearly ^R^iws off how imporoverished our prosodic models are is 
the comparing of predicted/prosody with natural prosody. Given a label file and an 
FO Target file the following cocjt^will generate \ that utterance using the current voice 

(define (resynth labfile fOfile) \^ 
(let ((utt (Utterance SegFO))) ; needjjome u to start with 
(utt.relation.load utt 'Segment lalffilg) 
(utt.relation.load utt 'Target f0file)\J 
(Wave_Synth utt)) S\ 

The format of the label file should be ^ne that can be read into Festival (e.g. the 
XLabel format) For example 



) 



0.02000 26 


pau 


0.09000 26 


ih; 


0.17500 26 


z ; 


0.22500 26 


dh; 


0.32500 26 


ae ; 


0.35000 26 


t ; 


0.44500 26 


ow , 


0.54000 26 


k; 


0.75500 26 


ey; 


0.79000 26 


pau 



6 



The target file is a little more complex again it is a label file l?Tit with features "pos" 
and "F0" at each stage. Thus the format for a naturally renderecKjersion of the above 
would be. \ 




# 

0.070000 
0.080000 
0.090000 
0.100000 
0.110000 
0.120000 
0.130000 
0.140000 
0.240000 
0.250000 



124 0, 
124 0, 
124 0, 
124 0, 
124 0, 
124 0, 
124 0, 
124 0, 
124 0, 
124 0 



pos 0, 
pos 0, 
pos 0. 
pos 0, 
pos 0. 
pos 0, 
pos 0. 
pos 0, 
pos 0, 
pos 0, 



070000 
080000 
090000 
100000 
110000 
120000 
130000 
140000 
240000 
250000 



fO 133. 
fO 129. 
fO 125. 
fO 121. 
fO 117. 
fO 115. 
fO 113. 
fO 111. 
fO 108. 
fO 102. 



045230 
067890 
364600 
554800 
248260 
534490 
769620 
513180 
386380 
564100 



% 



79 



Chapter 8. Building prosodic models 



pos 0.260000 
pos 0.270000 
pos 0.280000 
pos 0.290000 
pos 0.300000 
pos 0.310000 
pos 0.320000 
pos 0.370000 
pos 0.380000 
pos 0.390000 
pos 0.400000 
pos 0.410000 
pos 0.420000 
0.430000 12f2*pos 0.430000 
0.440000 124 C^-pos 0.440000 
:^pos 0.450000 
px>Rf .560000 
pMD370000 
pos*lLte0000 
pos 0*590000 

pos ommo 

pos 0.68OT10 
pos 0.690*0611 



0.260000 124 0 
0.270000 124 0 
0.280000 124 0 
0.290000 124 0 
0.300000 124 0 
0.310000 124 0 
0.320000 124 0 
0.370000 124 0 
0.380000 124 0 
0.390000 124 0 

0.400000 m 0 
o.4ioooo < mp 

0.420000 12§& 



fO 97.383600 
fO 97.199710 
fO 96.537280 
fO 96.784970 
fO 98.328150 
fO 100.950830 
fO 102.853580 
fO 117.105770 
fO 116.747730 
fO 119.252310 
fO 120.735070 
fO 122.259190 
fO 124.512020 
fO 126.476430 
fO 121.600880 
fO 109.589040 
fO 148.519490 
fO 147.093260 
fO 149.393750 
fO 152.566530 
fO 114.544910 
fO 119.156750 
fO 120.519990 
pos 0.7000WjJ0 121.357320 
pos 0.71000(\>|0 121.615970 



0.450000 124 0 
0.560000 124 0 
0.570000 124 0 
0.580000 124 0 
0.590000 124 0 
0.670000 124 0 
0.680000 124 0 
0.690000 124 0 
0.700000 124 0 
0.710000 124 0 

0.720000 124 0 ; pos 0.720000 Vtfjl20.752700 ; 
This file was generated from aVvaveform using the folloing command 

pda -s 0.01 -otype ascii -fmax 160 -fi#jro70 wav/utt003.wav I 
awk 'BEGIN @( printf("#\n") @} 

@{ if ($1 > 0) \ 
printf("%f 124 0 ; pos %f ; fO %f ; \n^> * 

NR*0.010,NR*0.010,$1) @}' >Targeb>j(utt003.Target 



The utetrance may then be rendered as C> 
festival> (set! uttl (resynth "lab/utt003.1ab" "Targ^/utt003.utt")) 



Note that this method will loose a little in diphone selection. If your diphone database 
uses consonant cluster allophones it wont be possible^oproperly detect these as there 
is no syllabic structure in this. That may or may not b^Smportant to you. Even this 
simple method however clearly shows how importanythe right prosody is to the 
understandability of a string of phones. 

We have successfully done this on a number of natural un^mnces. We extracted the 
labels automatically by using the aligner discussed in th&sliphone chapter. As we 
were using diphones from the same speaker as the natural(u)terances (KAL) the 
alignment is surprisingly good and trivial to do. You must hoi^rver synthesis the 
utterance first and save the waveform and labels. Note you should listen to ensure 
that the synthesizer has generated the right labels (as much as tlfat is possible), in- 
cluding breaks in the same places. Comparing synthesized utterar(cejg with natural 
ones quickly shows up many problems in synthesis. 

Prosody Walkthrough 

This section gives a walkthrough of a set of basic scripts that can be used to build 
duration and F0 models. The results will be reasonable but they are designed to be 
language independent and hence more appropriate models will almost certainly give 
better results. We have used these methods when building diphone voices for new 



80 



Chapter 8. Building prosodic models 



languages when we know almost nothing explicit about the language structure. This 
walkthrough however explcitly covers most of the major steps and hence will be 
useful as a basis for building new better models. 

In many ways this process is simialr to the limited domain voice building process, 
here we will design a set of prompts which are believed to cover the prosody that 
we wish to model, we record and label the data and then build models from the 
utterances built from the natural speech. In fact the basic structure for this uses the 
limited domain scripts for the initial part of the process. 

The basic si&ges of this task are 



Design database 
Setup directory structure 
Synthesizing prc^npts (for labeling) 
Recording prompts^ 
Phonetically label m^mpts 
Extract pitchmarks aii^FO contour 
Build utterance structu^ 
For duration models ^ 

• extract means and standS^ deviations of phone durations 

• extract features for predict^Jn 

• build feature description file's^ 

• Build regression model to precriSdurations 

• construct scheme file with duration model 

• For FO models ^\ 

• extract features for prediction 

• build feature description files ^ 

• Build regression model to preict FO at star£Shid and end of syllable 

• construct scheme file with FO model • > 

o 

Design database 

The object here is to cpature enough speech in prosodic^Jtyie that you wish your 
syntehsizer to use. Note as prosodic modeling is still and ^Eri^mely difficult area all 
models are extremely imporerished (especially the very simnle\models we are pre- 
senting here), thus do not be too ambitious. However it is wenhwhile consider if 
you wish dialog (i.e. conversational speech) or prose (i.e. read speech). Prose can be 
news reader style or story telling style. Most synthesizers are trained on news reader 
style becuase its fairly consistent and believe to be easier to model, aFtd reading para- 
graphs of text is seens as a basic apllication for text to speech synthesizers. However 
today with more dialog systems such prosodic models are often not asjbpropriate. 

Ideally your database will be marked up with prosodic tagging that your^wice talent 
will understand and be able to deliver appropriately. Designing such a database isn't 
easy but when starting off in new languages anything may be better than fixed dura- 
tions and a naive declining FO. Thus simply a list of 500 sentences from newspapers 
may give rise to better models than. 

Suppose you have your 500 sentences, construct a prompt list as is done with the 
limited domain constuction. That is, you need a file of the form. 

81 



Chapter 8. Building prosodic models 

( sent_0001 "She had your dark suit in greasy washwater all year.") 
( sent_0002 "Don't make me carry an oily rag like that.") 
( sent_0003 "They wanted to go on a barge trip.") 



Setup directory structure 



As with trig-rest of the festvox tools, you need to set the following to environment 
variables follow them to work properly. In bash or other Bourne shell compatibles 
type, with the < ^topropriate pathnames for you installation of the Edinburgh Speech 
Tools an Festvwritself . 

• 

export FESTVOXtjtky home / awb/ projects/ festvox 
export ESTDIR= /herffte/ awb / projects /1.4.1/ speech_tools 

For csh and its derhQtrvg you should type 

setenv FESTVOXDIR /hom^/ awb/ projects/ festvox 
setenv ESTDIR /home/awlv)6srojects/1.4.1/speech_tools 

As the basic structure is so sifiyiar to the limited domain building structure, first you 
should all that setup procedure^If you are building prosodic models for an already 
existing limited domain then yoU^ip not need this part. 

mkdir cmu_timit_awb 
cd cmu_timit_awb 

$FESTVOXDIR/src/ldom/setup_ldom cnprtimit awb 

The arguments are, institution, domain ^£>e, and speaker name. 

After setting this up you need to also setu^-fche extra directories and scripts need to 
build prosody models. This is done by the command 

$FESTVOXDIR/ src/prosody/ setup_prosody 
You shold copy your database files as created in the previous section into etc/. 

o 

Synthesizing prompts \$ 

We then synthesizer the prompts. As we are trying to coj^c) natural speech these 
prompts should not normally be presented to the voice talent as\they may then copy 
the syntehsizer intonation, which would almost certainly be a Crad^thing. As this will 
sometimes be the first serious use of a new diphone syntehsize^in a new language, 
(with impoverished prosody models) it is important to check that ^ie prompts can be 
generate phonetically correct. This may require more additions to thriexicon and/ or 
more token to word rules. We synthesize the prompts for two reasons^ First, to use 
for autolabeling in that the synthesized prompts will be aligned usiAgJdtw against 
what the speaker actually says. Second we are trying to construct festiv^fjitterances 
structures for each utterance in this database with natural durations arid FO. so we 
may learn from them. 

You should change the line setting the "closest" voice 
(set! cmu_timit_awb::closest_voice 'voice_kal_diphone) 




82 



Chapter 8. Building prosodic models 



This is in the file f estvox/cmu_timit_awb_ldom. scm. This is the voice that will be 
used to syntehsized the prompts. Often this will be your new diphone voice. 

Ideally we would like these utterances to also have natural phone sequences, such 
that schwas, allophones such as flaps, and post-lexical rules have been applied. At 
present we do not include that here though for more serious prosody modeling such 
phonomena should be included in the utterance structures here. 

The prompts can be synthesizer by the command 

festival -Jgiestvox/build_ldom.scm '(build_prompts "etc/timit.data")' 

Recording thej^rompts 

The usual caveats ap|j)y to recording, (the Section called Recording under Unix in Chapter 
and the issues on sele^ng a speaker. 

As prosody modeling is<f)fficult, and if you are inexperienced in building such mod- 
els, it is wise not to attempt anything hard. Just building reliable models for default 
unmarked intonation is verv useful if your current models are simply the default 
fixed intonation. Thus the Sifajetences should be read in a natural but not too varied 
style. y\ 

Recording can be done withvg^jintyclicky or prompt_them. If you are using 
prompt_them, you should modifjj^hat script so that it does not play the prompts, as 
they will confuse the speaker. Thev^neaker should simply read the text (and markup, 
if present). \j * 

, 

pointyclicky etc/timit.data 

% 

bin/prompt_them etc/ timit.data 



or 



'•6 



Phonetically label prompts 

After recording the spoken utterances must be labeledCD 
bin/make_labs prompt-wav/*.wav \§\ 




This is one of the computationally expensive parts of the^EOcess and for longer 
sentences it can require much memory too. \J 

After autolabeling it is always worthwhiel to inspect the labels and correct mistakes. 
Phrasing can particularly cause problems so adding or deleting silances can make the 
derived prosody models much more accurate. You can use emulabe^f^ to this. 

emulabel etc/emu_lab 



83 



Chapter 8. Building prosodic models 



Extract pitchmarks and FO 

At this point we diverge from the process used for building limited domain syn- 
thesizers. You can construct such synthesizers from the same recordings, maybe you 
wish more appropriate prosodic models for the fallback synthesizer. But at this poijnt 
we need to extract the pitchmark in a slightl different way. We are intending to extract 
FO contours for all non-silence parts of the speech signal. We do this by extracting 
pitchmarks for the voiced sections alone then (in the next section) interpolating the 
FO through the non-voiced (but non-silence) sections. 

the Section^alled Extracting pitchmarks from waveforms in Chapter 4 discusses the set- 
ting of parefheters to get bin/make_pm_wave to work for a particular voice. In this 
case we neecjjhose same parameters (which should be found by experiment). These 
shold be copie&jrom bin/make_pm_wave and added to bin/make_FO_pm in the vari- 
able pm_args. The distribution contains something like 

PM_ARGS='-mir^Q57 -max 0.012 -def 0.01 -wave_end -bc_lf 140 -lx_lo 111 -lxjhf 80 -lx_ho 51 ■ 
med_o 0' v O 

Importnantly this drEfej^ f rom the parameters in bin/make_pm_wave as we do not 
use the - fill option to f^a in pitchmarks over the rest of the waveform. 

The second part of this siSiion is the construction of an F0 contour which is build 
from the extracted pitchm^pjs. Unvoiced speech sections are assigned an F0 con- 
tour by interpolation from the^oiced section around it, and the result is smnoothed. 
The label files are used to definewhich parts of the signal are silence and which are 
speech. \ . 

The variable silence in bin/make f0_pm must be modified to reflect the symbol 
used for silence in your phoneset. 

Once the pitchmark parameters hav^J^e determined, and the appropriate silence 
value set you can extract the smoothecKFO by the command 

o 

bin/make_f0_pm wav/*.wav 

You can view the F0 contrours with the comm; 
emulabel etc/ emu_f0 



O 



Build utterance structures \$\ 

With the labels and F0 created we can now rebuild the utl^aXce structures by syn- 
tehsizing the prompt snad merging in the values from the na(u^al durations and F0 
from the naturally spoken utterances. 



festival -b festvox/build_ldom.scm '(build_utts "etc/timit.data")' 



Duration models 



% 



The script bin/make_dur_model contains all of the following commands but it is wise 
to understand the stages as due to errors in labeling it may not all run completely 
smoothly and small fixes may be required. 



84 



Chapter 8. Building prosodic models 



We are building a duration model using a CART tree to predict zscore values for 
phones. Zscores (number of standard deviations from the mean) have often been 
used in duration modeling as they allow a certain amount of normalization over 
different phones. 

You shold first look at the script bin/make_dur_model and edit the following three 
variable values 



SILENCENAME=SIL 

VOICENAME='(kal_diphone)' 

MODEL?^ME=cmu_us_kal 

these shoulcrasntain the name for silence in your phoneset, the call for the voice you 
are building tnii-model for (or at least one that uses the same phoneset), and finally 
the name for the,model, which can be the same inst_lang_vox part of the voice you 
call. ^ 

The first stage is to $rra the means and standard deviations for each phone. A festival 
script in the festival Xustribution is used to load in all the utetrances and a calculate 
these values. With theSpommand 

durmeanstd -output festi^a^X dur/etc/durs.meanstd festival/ utts/*.utt 

You should check f estivC^jdur/etc/durs .meanstd, the generated file to ensure 
that the numbers look raosnable. If there is only one example of a particular phone, 
the standard deviation cannot^ae'calculated and the value is given as nan (not-a- 
number). Thus must be changed,^ a standard numeric value (say one-third or the 
mean). Also some of the values in tnis table maybe adversely affected by bad labeling 
so you may wish to hand modify the values, or go back and correct the labeling. 

The next stage is extract the f eatures^frorn which we will predict the durations. The 
list of features extracted as in f estiva^Alur/etc/dur/f eats. These cover phonetic 
context, syllable, word position etc. These^nay or may not be appropriate for your 
new language or domain and you you may wish to add to these before doing the 
extraction. The extraction process takes eacVphoneme and dumps the named feature 
values for that phone into a file. This uses tr(g standard festival script dumpf eats to 
do this. The command looks like ^ 

$DUMPFEATS -relation Segment -eval $VOICENA%^ 
-feats festival /dur /etc/ dur.f eats = \j 
-output festival/ dur/ feats/%s.feats \ 
-eval festival/ dur/ etc/logdurn.scm \ 
festival/utts/*.utt 

These feature files are then concatenated into a single file^which is then split (90/10) 
into traing and test sets. The training set is further split forceuse as a held-out testset 
used in the training phase. Also at this stage we remove atftiJence phones form the 
training and test set. This is, perhaps naively, because the distribution of silences is 
very wide and often files contain silences at the start and encLa^ utterances which 
themslves aren't part of the speech content (they're just the edges) and having these 
in the training set can skew the results. • 

This is done by the commands O 

cat festival/dur/feats/*. feats I \ Vp 

awk '(if ($2 != '"$SILENCENAME"') print $01' >festival/dur/data/dur.date 
bin/traintest festival/ dur/ data/ dur.data 
bin/traintest festival/ dur/ data/ dur.data. train 



For wagon the CART tree builder to work it needs to know what possible values each 
feature can take. This can mostly be determined automatically but some features may 

85 



Chapter 8. Building prosodic models 



have values that could be either numeric or classes, thus we use a post-processing 
function on the automatically generated description file to get our desired result. 

$ESTDIR/bin/ make_wagon_desc festival /dur/ data/ dur.data \ 

festival/ dur/ etc/ dur.feats festival/ dur/ etc/ dur.desc 
festival -b —heap 2000000 festvox/build_prosody.scm \ 

$VOICENAME ' (build_dur_f eats_desc) ' 



Now we o&vbuild the model itself. A key factor in the time this takes (and the ac- 
curacy of tngjmodel) is the "stop" value, that is the number of examples that must 
exist before zr&jlit searched for. The smaller this number the longer the search will 
be, though up^5s*a certasin point the more accurate the model will be. But at some 
level this will oy,er train. The default in the distribution is 50 which may or may not 
be appropriate. NoAfor large databases and for smaller values of stop the training 
may take days evenart a fast processor. 

Although we have guessed a reasonable value for this for databases of around 50- 
1000 utterances it may^rot>be appropriate for you. 

The learning technique Vised is basic CART tree growing but with an important ex- 
tention which makes the process much more robust on unseen data but unfortunately 
much more computationallrt^Sxpensive. The -stepwise option on wagon incremen- 
tally searches for the best features to use in building three, in addition to at each 
iteration finding the best questions about each feature that best model data. If you 
want a quicker result removing^ne -stepwise option will give you that. 

The basic wagon command is 

wagon -data festival/ dur/ data/ dur.dyfaStrain.train \ 
-desc festival/ dur/ etc/ dur.desc \ \ » 
-test festival/ dur/ data/ dur.data.trainCtesi,\ 
-stop $STOP \ r\ 
-output festival/dur/tree/$PREF.S$STOP±ree \ 
-stepwise 

To tets the results on data not used inthe trarn^g we use the command 

wagon_test -heap 2000000 -data festival/ dur/ data /dm^ata.test \ 
-desc festival/ dur/ etc/ dur.desc \ \_) 
-tree festival/dur/tree/$PREF.S$STOP.tree ^> 

Interpreting the results isn't easy in isolation. The £)aller the RMSE (root mean 
squared error) the better and the larger the correlation iQhe better (it should never 
be greater than 1, and should never be below 0, thougnif wiu model is very bad it 
can be below 0). For English, with this script on a Timit cm^base we get an RMSE 
value of 0.81 and correlation of 0.58, on the test data. Note tffe^ values are not in the 
abosolute domain (i.e. seconds) they are in the zscore domain^) 

The final stage, probabaly after a number of iterations of the bu^S process we must 
package model into a scheme file that can be used with a voice^. This scheme file 
contains the means and standard deviations (so we can convert the ©indicted values 
back into seconds) and the prediction tree itself. We also add in predictions for the 
silence phone by hand. The comamnd to generate this is CD 



festival -b -heap 2000000 \ 

festvox/build_prosody.scm $VOICENAME \ 
'(finalize_dur_model " '$MODELNAME '$PREF.S$STOP.tree'")' 



86 



Chapter 8. Building prosodic models 



This will generate a file f estvox/cmu_timit_awb_dur . scm. If you model name is the 
same as the basic diphone voice you intend to use it in you cna simply copy this file 
to the f estvox/ directory of your diphone voice and it will automatically work. But 
it is worth explaining what this install process really is. The duration model scheme 
file contains two lisp expression setting the the variables modelname: :phone_durs 
and modelname : : zdurtree. To uses these in a voice you must load this file, typically 
by adding 



(require 'MODELNAME_dur) 

to the diprjp«ie voice definition file (f estvox/MODELNAME_diphone . scm). And then 
get the voicSlfefintion to use these new variables. This is done by the commands in 
the voice defirijtion function 

• 

;; Duration prediction 

(set! duration_caMtrise MODELNAME: :zdurtree) 
(set! duration_ph_*ipb MODELNAME ::phone_durs) 
(Parameter.set 'Dura+i^i_Method Tree_ZScores) 

® 

FO contour models v^> 

(what about accents ?) y** 

extract features for prediction feature description files Build regression model 

to preict FO at start, mid and end ofr pliable construct scheme file with FO model 

\ 



'•6 



o 



% 



87 



Chapter 8. Building prosodic models 




88 



Chapter 9. Corpus development 



This chapter discusses the techniques used to design a good corpus for recording for 
use in general speech synthesis. The basic requirements for a speech synthesis corpus 
are: 



• Phonetically and prosodically balanced 

• Targeted toward the intended domain(s) 

• Easy to SSy>by voice talent without mistakes 

• Short enoifgg^for the voice talent to be willing to say it. 

To make life easier in designing prompt list we have included scripts which go some 
way to aid the process. The general idea is to take a very large amount of text and 
automatically fincOFiice'' utterances that match these criteria. 



The CMU ARCTIC patabase prompt list [kominek 2003] was created very much in 
this way though witK^i earlier version of the scripts. 

As with most of our wd^^there is a single script, that does a number of basic stages. 
The default is reasonable j^tnany cases, but with prompt selection it is always worth 
hand checking and potentta^ modifying and refining the process. 

The basic idea is to limit the^Hpsen utterances to those of a reasonable length (5-15 
words), only choose sentences«wiih high frequency words (which should be easier to 
say and less ambiguous in pronunciation, also restrict to words that are in the lexicon 
(avoiding letter to sound issues). v^) 

The script make_nice_prompts istjpt up for two classes of language, Latin script 
languages and non-Latin script lanaetSges. Though as the non-Latin case is much 
more varied you may need to modifyrhipgs. We have successfully used it for UTF-8 
encoded Hindi. \ ^ 

For the Latin-script languages (as opposed />asis" cases) we downcase the text when 
looking for variation. Although some Lati^-^ased language make a significant case 
distinction, e.g. German, this is a reasonable ^cXite to avoid sentences with too many 
repeated words in them. ^ 

First gather lots of text data. When we say lots we mean at least millions of words, 
or even 10s on millions of words. This basic selepifbn process is aimed at getting 
sentences for general voices and hence as large arrmunt.of starting data as possible is 
important. y — 

Please also note the copyright of the data you are selecting from. In CMU ARCTIC we 
used out-of -copyright texts from the Gutenberg project^s^ there would be no issue 
in distributing the data. Copyright law in many counme^^allows for small subsets 
for copyright data, but this fair use is often argued by some^ there may note actually 
be a good solution to this, News stories, are typically copyright but the press agency 
releasing them. Licenses on LDC data are often sufficient for u^rfte such texts to build 
prompts and then having no restriction on the voices generated^ but the database 
itself may be under question. If you care about distribution (free- or selling) you will 
need to address these issues. • 

The first stage once you have collected you data is check its encoClrXg. Make sure 
its all the same encoding. Also check its reasonable. For example the ^S"opal data is 
nice and clean (as conditioned for Machine Translation models) but ths-^unctuation 
has been separated from the words. You may want to to de-htmlify yourraata before 
passing it to the selection routines. 

Once you have a set of nice relatively clean data, you use the first stage of the script. 
This finds the word frequencies of all the tokens in the text data. 

$FESTVOXDIR/src/promptselect/make_nice_prompts find_freq TEXTO TEXT1 TEXT2 



Chapter 9. Corpus development 



Give all the text files as arguments to this script. The working files will be created in 
the current directory, but the text file arguments may be pathnames. 

The next stage is to build a Festival lexicon for the most frequent words. By default we 
select the top 5000 words, which has proven a reasonable choice. You can optionally 
override the 5000 with an argument. 



$FESTVOXDIR/ src / promptselect / make_nice_prompts make_freq_lex 



The next processes each sentence in the large text database to find those utter- 
ances that ar^Vjice". That of reasonable length, has only words in the frequency lex- 
icon, no stranggfjpunctuation, capitals at the beginning, and punctuation at the end, 
and a few other heuristic rule conditions. These seem to work well for the Latin-script 
languages (though it is possible the conditions are overly strict for some languages). 

Although FestivaFs^te^ct front end is used for processing the text, you do not need 
to build any language specific text front end (at least not normally). Finding nice 
prompts is consideredT^phe of the very first parts of buildings support for a new lan- 
guage, so we are awarevtlmt there will be almost no resources available for the target 
language yet. ^ 

Because this process is using-Festival's front end, it is not fast, as it needs to process 
the whole text database. It isSroi unusual for this to take a number of hours to process. 
While processing "nice" utterapees are written to data_nice . data. You should check 
this regularly in case there is ss^e inappropriate condition in the rules and you are 
getting the wrong type of data. 

$FESTVOXDIR/src/promptselectV^ke_nice_prompts find_nice TEXT0 TEXT1 ... 

Note this will only search for the first JPO^OOO nice utterances, from the data set, you 
can change the number in the script if yoip^vant more (or less). 

Once the "nice" utterances are found you ca»*now find those nice utterances that have 
the best phonetic coverage. There are two m^hanism available here. Because this is 
often the very first stage in building support £pr a new language, no lexicon and pho- 
netics are available, thus selecting based on pl^^ietic is not an option. Therefore we 
provide a simpler technique that selections based on letter coverage (in fact di-letter 
coverage). This is often a reasonable solution, ftuLit will depend on the language 
whether this is reasonable solution or not. Note thsitior English, in spite of its some- 
what poor relationship between orthography and pronunciation, this is reasonable, 
so don't exclude this as a possibility without trying it. 

Letter selection will find the subset of the nice utterance^That has the best letter cov- 
erage. It is a greedy algorithm, but this is usually sufficiefvtlXhis process also takes a 
number to define the number of utterances it is looking fo».\The process will be ap- 
plied multiple times to the remaining data until that numb^J reached. If there isn't 
enough data to select from it might loop for ever. By default jfiooks for 1000 utter- 
ance, which is not unreasonable for a unit selection voices, 500~T^>erobably sufficient 
for a CLUSTERGEN voice. But, as they say, your mileage may vary. 

• 

$FESTVOXDIR/ src / promptselect/ make_nice_prompts select_letter_r^"\. 

% 

If you do have a pronunciation lexicon for you language, you can also do^elect based 
on segments rather than letters. We have not done exhaustive comparisons of how 
valuable segment selection is over letter selection. It is clear that although probably 
important, it is probably less important that selecting a good voice talent, or recording 
the prompts in a high quality manner. Two stages are required for segment selection. 
The first stage is to render the nice prompts from words to segments 



90 



Chapter 9. Corpus development 



$FESTVOXDIR/src/promptselect/ make_nice_prompts synth_seg 
Then greedily select the utterances with the best di-phone coverage. 

$FESTVOXDIR/ src/promptselect/ make_nice_prompts select_seg 

We do not yet support select_seg_n. 

After selection the nice prompts will me in data . done . data. Look at it. Do not ex- 
pect it to be^perfect. I have never done this for a new language, without having to do 
it multiple^jnes until I get something reasonable. Even once you have the result, it is 
worth whilei^hecking each utterance and correcting and/ or rejecting it for other rea- 
sons, such as uggrammatical, hard to read, ambiguous words etc. Be bold and get rid 



of weird sente»e£s, it will save you trouble later. The selection process is deliberately 
designed to have redundancy as speech is a variable medium and we can never be 
sure what exact tbeTyoice talent will say, or how the unit selection process will select 
the units from the cmt^base. 

It is wise to first go through every sentence and attempt to record it and at that time 
decide if the sentence SsWtually a reasonable utterance to include in the prompt set 
for that language. \J 

The final stage extracts tr(^vocabulary of the selected prompt set. You can use this 
vocab list to start building j^)ir pronunciation lexicon as you will need that to build 
your speech databases (unle$S*you are using an orthography based selection tech- 
nique), y^* 

$FESTVOXDIR/ src/promptsele^make_nice_prompts find_vocab 

\ ' 

You can also do the whole process withtine command 

6 

$FESTVOXDIR/src/promptselect/makeSiijse_prompts do_all TEXTO TEXT1 ... 

As really the f ind_nice stage takes up abwuk 98% of processing time, redoing the 
other parts each time isn't unreasonable. ^ 

Non-Latin-script languages ^5 

For non-Latin-script languages there are options thcXf'seem to work well, if the lan- 
guage has spaces between words. We have used this q@:e extensively for UTF-8 en- 
coded languages (Arabic and Hindi). For these languagtjTige 

$FESTVOXDIR/src/ promptselect/ make_nice_prompts selec^deg 
$FESTVOXDIR/src/promptselect/make_nice_prompts find^ed_asis TEXTO TEXT1 ... 

$FESTVOXDIR/ src/ promptselect/ make_rtice_prompts make_f{e^_lex 
$FESTVOXDIR/ src/promptselect/ make_nice_prompts find_nice_a3fe TEXTO TEXT1 ... 

$FESTVOXDIR/src/ promptselect/ make_nice_prompts select_lettfer_n 

$FESTVOXDIR/src/promptselect/make_nice_prompts find_vocab _^sis 



You can also do the whole process with the command 



$FESTVOXDIR/src/ promptselect/ make_nice_prompts do_all_asis TEXT0<T£XT1 .. 



> 



For languages that do not have spaces between the words (Chinese, Japanese, Thai 
etc), the above techniques will not work. We have used the above techniques for 
Chinese, by first segmenting the data into words. 



91 



Chapter 9. Corpus development 




92 



Chapter 10. Waveform Synthesis 



This part of the book discusses the techniques required to actually create the speech 
waveform from a complete phonetic and prosodic description. 

Traditionally we can identify three different methods that are used for creating wave- 
forms from such descriptions. 

Articulatory synthesis is a method where the human vocal tract, tongue, lips, etc are 
modeled in a computer. This method can produce very natural sounding samples of 
speech bu^dt present it requires too much computational power to be used in pratical 
systems. 

Formant syntneste is a technique where the speech signal is broken down into iverlap- 
ping parts, suePr*as formats (hence the name), voicing, asperation, nasality etc. The 
output of individual generators are then composed from streams of parameters de- 
scribing the each surij-part of the speech. With carefully tuned parameters the quality 
of the speech can t5e<1ose to that of natural speech, showing that this encoiding is 
sufficient to represei\triuman speech. However automatic prediction of these param- 
eters is harder and tHthjypical quality of synthesizers based on this technology can 
be very understandabfsHftyt still have a distinct no-human or robotic quality. Format 
synthesizers are best typltted MITalk, the work of Dennis Klatt, and what was later 
commercialised as DECTafk. DECTalk is at present probably the most familar form 
of speech synthesis to the w^ld. Dr Stephen Hawkins, Lucindian Professor of Math- 
ematics at Cambridge Univea'S&ty, has lost the use of his voice and uses a formant 
synthesizer for his speech. >* 

Concatenative synthesis is the thircLfcrm which we will spend the most time on. Here 
waveforms are created by concatenating parts of natural speech record from humans. 

In the search for more natural syntnegis there has been a move to use recorded hu- 
man speech rather than techniques to construct it from its fundamental parts. The 
potential for using recorded speech h^s^in some sense been made possible by the 
increase in power of computer and storage^nedia. Format synthesizers require a rel- 
atively small amount of data but concatehaMve speech synthesizer typically require 
much more disk space to contain the mventpfVjof speech sounds. And more recently 
the size databases used has grown further as(fnere is some relationhip between voice 
quality and database size. ^ 

The techniques described here and the following chapters are concerned solely of 
concatenative synthesis. Concatenative synthesis techniques not only give the most 
natural sounding speech synthesis, it also is the mest accessible to the general user 
in that it is quite easy for us to record speech and th^fechnique used here to analyse 
it are, to the most part, automatic. 

The area of concatenative systems can be viewed as a co^ijfclex continunium, as there 
are many choices in selecting unit type, size etc. We haverjncluded examples and 
techniques from the most conservative (i.e. most likely tcH^erk) to the forefront of 
the art of voice building. It is important to understand fKe^pace of possible syn- 
thesis techniques in order to select the best one for your part(cylar application. The 
resources required to develop each of the basic types of concatenative synthesizers 
varies greatly. It is possible to get a general speech synthesizerSvorking in English 
in under an hour, though its quality isn't very good (though can bft understandable). 
At the other extreme, in the area of speech synthesis we have yet ^"Itievelop a sys- 
tem that is both natural and flexible to satisfy all synthesis requiremefTte so the task 
of building a voice may take a lifetime. However the following chaptepajutline tech- 
niques which can be completed by an interest person in as little a day<p? at most a 
week which can produce high, natural sounding voices suitable for a wide range of 
computer speech applications. 

In order for synthesis of any piece of text we need to have examples of every unit 
in the language to be synthesized. At some extreme this would mean you'd need 
recordings of every sentence (or even paragraph) of everything that needs to be said. 
This of course is impractical and defeats the whole purpose of a having a synthe- 

93 



Chapter 10. Waveform Synthesis 



sizer. Thus we need to make some simplifying assumptions. The simplest (and most 
extreme) is to assume that speech is made of up strings of discrete phonemes. US En- 
glish has (by one definiton) 43 different phonemes. That is the fundamental sounds 
in the language, thus the word, "bit" is made up of three phones /B/ /IH/ and /T/. 
The word "beat" however is made up of the phonemes /B/ /IY/ and /T/. 

In the following chapter we consider the absolute simplest waveform synthesizer 
that consists of recording each phone int he language and the resequencing them to 
form new words. Although this is a easy and quick synthesizer to build it is immedi- 
ately obvious that it is not of very good quality. The reason being thate human speech 
just doesn"%£onsist of isloated phonemes concatenated to gether but that there are 
articulatory ^f ects that cross over multiple phones (co-articulation). Thus the more 
practical tecrffuque is to build a diphone synthesizer where we record each phoneme 
in the context «£-feach other phoneme. 

However speectt is more varied than that although we can modify the selected di- 
phones to obtain tj$<^ desired prosody, such modification is not as good as if it were 
actual spoken by a human. Thus the area of general unit selection synthesis has grown 
where the datbases wf: select from has many more examples of speech, in more con- 
texts and not just one o^ample of each phoneme-phoneme transition in the language. 
The size and design of S^ktabases most suitable for unit selection is difficult and we 
discuss this in the follow(ffife chapter. The techniques required to find the most ap- 
prorpiate unit, taking into aesount, phonetic context, word position, phrase position 
as well prosodic context is mraortant but finding the right palance of these features 
is still something of an art. In , CHapter 12 we present a number techniques and exper- 
iments to build such synthesize^ * 

Although unit selection synthesiser clear offer the best quality synthesis, their 
databases are substantial piece ofrwork. They must be preoperly labeled and 
although we include automatic alignment techniques there will always be mistakes 
and hand correction is certainly both desirable and worthwhile. But that takes 
time and certain skills to do. When v^re* unit selection is bad due to bad labels 
inappropriate weight of features or just-^Shiply not enough good examples int he 
dtaabase to choose from the quality carVbp serverely worse than diphones so the 
work in tuning a unit selection synthesizer^as much avoiding the bad examples as 
improving the good ones. The third chapfe^ on waveform syntehsizers offers a 
very rpactical compramise between diphone fefe) synthesizers and unit selection 
(exciting) synthesizers. In Chapter 5 on limitedNaomain synthesis, we discuss how to 
target your recorded database to a particular application and get the benefits of the 
high quality of unit selection synthesis without requirement of very carefully 
labeled databases. In many case this thrid option is thg^nost practical. 

Of course it is possible to combine approaches. This re^ftries more care in the design 
but sometimes offers the best of all techniques. Havins-a targeted limited domain 
synthesizer which can cover most of the desired langtisge will be good but falling 
back on good unit selection synthesizer for unknown wortJ^Vrmy be a real posibility. 

Key choices: f~\ 

size, type, prosodic modification, number of occurrences 
Key positions in the space 
uniphones, diphones 
unit selection, limited domain vs open 



o 

Need diagram for space of synthesizers 



94 



Chapter 11. Diphone databases 



This chapter describes the processes involved in designing, listing recording, and 
using a diphone database for a language. 

Diphone introduction 

The basic idea behind building diphone databases is to explicitly list all possible 
phone-phorte transitions in a language. This makes the incorrect but practical and 
simplifying^ssumption that co-articulatory effects never go over more than two 
phones. The^jsact definition of phone here is in general nontrivial, and what a "stan- 
dard" phone s^jShould be is not uncontroversial — various allophonic variations, 
such as light and dark /l/, may also be included. Unlike generalized unit selection 
where multiple Occurrences of phones may exists with various distinguishing fea- 
tures, in a diphori^ka|abase only one occurrence of each diphone is recorded. This 
makes selection mueraeasier but also makes for a large laborious collection task. 

In general, the numb©£)of diphones in a language is the square of the number of 
phones. However, in n^raral human languages, there are phonotactic constraints — 
some phone-phone pairg^Leven whole classes of phones-phone combinations, may 
not occur at all. These gapN> are common in the world's languages. The exact defi- 
nition of never exist is also ^^blematic. Humans can often generate those so-called 
non-existent diphones if they^fefy, and one must always think about phone pairs that 
cross over word boundaries aSvueJi, but even then, certain combinations cannot ex- 
ist; for example, /hh/ /ng/ rrf^xiglish is probably impossible (we would probably 
insert a schwa), /ng/ may reallympdy appears after the vowel in a syllable (in coda 
position); however, in other language* it can appear in syllable-initial position, /hh/ 
cannot appear at the end of a syllable, though sometimes it may be pronounced when 
trying to add aspiration to open vowsds. 

Diphone synthesis, and more generallyCarvv concatenative synthesis method, makes 
an absolutely fixed choice about which ul^its exist, and in circumstances where some- 
thing else is required, a mapping is necessity. When humans are given a context 
where an unusual phone is desired, for examjate in a foreign word, they will (often) 
attempt to produce it even though it falls ovtside their basic phonetic vocabulary. 
The articulatory system is flexible enough to p^duce (or attempt to produce) unfa- 
miliar phones, as we all share the same underlying physical structures. Concatena- 
tive synthesizers, however, have a fixed inventorv^nd cannot reasonably be made 
to produce anything outside their pre-defined vocabulary. Formant and articulatory 
synthesizers have the advantage here. This is a basio^fade off, concatenative synthe- 
sizers typically produce much more natural synthesis fRSn formants synthesizer but 
at the cost of being only able to produce those phones depiaed within their inventory. 

Since we wish to build voices for arbitrary text-to-speeensystems which may include 
unusual phones, some mapping, typically at the lexical levSican be used to ensure 
all the required diphones lie within the recorded inventoryAEPie resulting voice will 
therefore be limited, and unusual phones will lie outside its rai(g^. This in many cases 
is acceptable though if the voice is specifically to be used for pj^ouncing Scottish 
place names it would be advisable to include the /X/ phone as m "loch". 

In addition to the base phones, various allophonic variations may alsevbe considered. 
Flapping, as when the / 1/ becoming a /dx/ in the word "butter" is an'fexample of an 
allophonic variation reduction which occurs naturally in American B@lish, and in- 
cluding flaps in the phone set makes the synthetic speech more naturalxSBressed and 
unstressed vowels in Spanish, consonant cluster /r/ verses lone /r/ in English, inter- 
syllabic diphones verses intra-syllabic ones — variations like these are well worth 
considering. Ideally, all such possible variations should be included in a diphone list, 
but the more variations you include, the larger the diphone set will be — remember 
the general rule that the number of diphones is nearly the square of the number of 
phones. This affects recording time, labeling time and ultimately the database size. 



95 



Chapter 11. Diphone databases 



Duplicating all the vowels (e.g. stressed/unstressed versions) will significantly in- 
crease the database size. 

These inventory questions are open, and depending on the resources you are willing 
or able to devote, can be extended considerably. It should be clear, however, that such 
a list is simply a basic set. Alternative synthesis methods and inventories of differ- 
ent unit sizes may produce better results for the amount of work (or data collected). 
Demi-syllable databases and mixed inventory methods such as Hadifix [portele96] 
may give better results under some conditions. Still, controlling the inventory and us- 
ing acoustic measures rather than linguistic knowledge to define the space of possible 
units in ygfeinventory has also been attempted as in work like Whistler [huang97]. 
The most exifeme view where the unit inventory is not predefined at all but based 
solely on whaf^e available in general speech databases is CHATR [campbell96]. 

Although generalized unit selection can produce much better synthesis than diphone 
techniques, usirrg more units makes selecting appropriate ones more difficult. In the 
basic strategy prdis^hted in this section, selection of the appropriate unit from the 
diphone inventory ^jrivial, while in a system like CHATR, selection of the appro- 
priate unit is a signffjlqantly difficult problem. (See Chapter 12 on unit selection for 
more discussion of sueKtechniques). With a harder selection task, it is more likely 
that mistakes will be rrrapte, which in unit selection can give some selections which 
are much worse worse th^jiiphones, even though other examples may be better. 

% 

Defining a diphone list 

Since diphones need to be cleai$ly\ articulated, various techniques have been pro- 
posed to elicit them from subjects': One technique is to use target words embedded 
carrier sentences to ensure that thevdirAones are pronounced with acceptable dura- 
tion and prosody (i.e. consistently). We 1 have typically used nonsense words that it- 
erate through all possible combinations^Jhe advantage of this is that you don't need 
to search for natural examples that have^the desired diphone, the list can be more 
easily checked and the presentation is lefe prone to pronunciation errors than if real 
words were presented. The words look unj(ajxiral but collecting all diphones in not a 
particularly natural thing to do. See [isard86tprxr [stella83] for some more discussion 
on the use of nonsense words for collecting diphones. 



For best results, we believe the words should be pronounced with consistent vocal 
effort, with as little prosodic variation as possiftle^n fact pronouncing them in a 
monotone is ideal. Our nonsense words consist oftg^imple carrier form with the di- 
phones (where appropriate) being taken from a middle^y liable. Except where schwa 
and syllabic consonants are involved that syllable shoyrW normally be a full stressed 
one. 

Some example code is given in sro/diphone/darpascMri^. scm. The basic idea is to 
define classes of diphones, for example: vowel consonant>Jransonant vowel, vowel 
vowel and consonant consonant. Then define carrier conte^t^ for these and list the 
cases. Here we use Festival's Scheme interpreter to generate-^the list though any 
scripting language is suitable. Our intention is that the diphofWwill come from a 
middle syllable of the nonsense word so it is fully articulated ans. minimize the artic- 
ulatory effects at the start and end of the word. • 

For example to generate all vowel vowel diphone we define a carri 

(set! vv-carrier '((pau t aa t) (t aa pau))) «Q. 
And we define a simple function that will enumerate all vowel vowel transitions 

(define (list-vvs) 
(apply 
append 
(mapcar 



96 



Chapter 11. Diphone databases 



(lambda (vl) 
(mapcar 
(lambda (v2) 
(list 

(string-append vl "-" v2) 

(append (car vv-carrier) (list vl v2) (car (cdr vv-carrier))))) 
vowels)) 
vowels))) 

For those of you who aren't used to reading Lisp this simple lists all possible com- 
binations £l£ in some potentially more readable format (in an imaginary language) 



for vl in vows 
for v2 in vowgls 
print pau t aa t aa P au 

The actual Lisp cofigJreturns a list of diphone names and phone string. To be more 
efficient, the DARPAb«J example produces consonant-vowel and vowel-consonant 
diphones in the same nop^ense word, which reduces the number of words to be spo- 
ken quite significantly. iSjaur voice talent will appreciate this. 



Although the idea seems simple to begin with, simply listing all contexts and pairs, 
there are other constraints. Spme consonants can only appear in the onset of a syllable 
(before the vowel), and othere^re restricted to the coda. 

While one can collect all the diphones without considering where they fall in a sylla- 
ble, it often makes sense to colleGf^phones in different syllabic contexts. Consonant 
clusters are the obvious next set to .consider; thus the example DARPAbet schema 
includes simple consonant clusters wrth explicit syllable boundaries. We also include 
syllabic consonants though these may(ye harder to pronounce in all contexts. You can 
add other phenomena too, but this is apthe cost of not only making the list longer 
(and making it take longer to record), Hnjtmaking it harder to produce. You must 
consider how easy it is for your voice talsnt .to pronounce them, and how consistent 
they can be about it. For example, not all American speakers produce flaps (/dx/) 
in all of the same contexts (and some may prifcluce them even when you ask them 
not to), and its quite difficult for some to pronounce them, which can lead to produc- 
tion/transcription mismatches. \V 

A second and related problem is language inteAenatice, which can cause phoneme 
crossover. Because of the prevalence of English, ^gpecially in electronic text, how 
many "foreign" phone should be considered for addition? For example, should /w/ 
be include for German speakers, (maybe), /t-i/ for Japanese (probably) or both /b/ 
and /v/ for Spanish speakers ("B de burro / V de vaca^%This problem is made diffi- 
cult by the fact that the people you are recording will o(tt£pMbe fluent or nearly fluent 
in English, and hence already have reasonably ability in pfh^nes that are not in their 
native language. If you are unfamiliar with the phone set Sf^i constraints on a lan- 
guage, it pays off considerably to either ask someone (like^ljriguist!) who knows 
the language analytically (not just by intuition), to check the lisejaature, or to do some 
research. 

To the degree that they are expected to appear, regardless of their status in the target 
language per se, foreign phones should be considered for the inventory Remem- 
ber that in most languages, nowadays, making no attempt to accommodate foreign 
phones is considered ignorant at least and possibly even arrogant. V-/. 

Ultimately, when more complex forms are needed, extending the "dipKjme" set be- 
comes prohibitive and has diminishing returns. Obviously there are phonetic dif- 
ferences between onset and coda positions, co-articulatory effects which go over 
more then one phone, stress differences, intonational accent differences, and phrase- 
positional difference to name but a few. Explicitly enumerating all of these, or even 
deciding the relative importance of each, is a difficult research question, and arguably 



97 



Chapter 11. Diphone databases 



shouldn't be done in an abstract, linguistically generated fashion from a strict inter- 
pretation of the language. Identifying these potential differences and finding an in- 
ventory which takes into account the actual distinctions a speaker makes is far more 
productive and is the fundamental part of many new research directions in concate- 
native speech synthesis. (See the discussion in the introduction above). 

However you choose to construct the diphone list, and whatever examples you 
choose to include, the the tools and scripts included with this document require that 
it be in a particular format. 

Each line sfctould contain a file id, a prompt, and a diphone name (or list of names if 
more than^rte diphone is being extracted from that file). The file id is used to in the 
filename for^je waveform, label file, and any other parameters files associated with 
the nonsense *$}rd. We usually make this distinct for the particular speaker we are 
going to recordTe.g. their initials and possibly the language they are speaking. 

The prompt is presented to the speaker at recording time, and here it contains a string 
of the phones in treu^bnsense word from which the diphones will be extracted. For 
example the followrqg is taken from the DARPAbet-generated list 

( uk_0001 "pau t aa b aartTja pau" ("b-aa" "aa-b") ) 
( uk_0002 "pau t aa p aa p\»a pau" ("p-aa" "aa-p") ) 
( uk_0003 "pau t aa d aa d^aVpau" ("d-aa" "aa-d") ) 
( uk_0004 "pau t aa t aa t aa ("t-aa" "aa-t") ) 
( uk_0005 "pau t aa g aa g aa 553*1" ("g-aa" "aa-g") ) 
( uk_0006 "pau t aa k aa k aa p^tr ("k-aa" "aa-k") ) 

( uk_0601 "pau t aa t ey aa t aa pau>£'ey-aa") ) 
( uk_0602 "pau t aa t ey ae t aa pauVc ey-ae") ) 
( uk_0603 "pau t aa t ey ah t aa pau" 'Tfey-ah") ) 
( uk_0604 "pau t aa t ey ao t aa pau" (' e^y^o") ) 

( uk_0748 "pau t aa p - r aa t aa pau" 
( uk_0749 "pau t aa p - w aa t aa pau" ("p-w^[ 
( uk_0750 "pau t aa p - y aa t aa pau" ("p-y">) » 
( uk_0751 "pau t aa p - m aa t aa pau" ("p-m")0^> 

Note the explicit syllable boundary markup)- for the consonant-consonant di- 
phones is used to distinguish them from the consonant cluster examples that appear 
later. 

Synthesizing prompts Q 

To help keep pronunciation consistent we suggest synth^pking prompts and playing 
them to your voice talent at collection time. This helps the^speaker in two ways — 
if they mimic the prompt they are more likely to keep a fixed, prosodic style; it also 
reduces the number of errors where the speaker vocalizesHtre wrong diphone. Of 
course for new languages where a set of diphones doesn't al^e^dy exist, producing 
prompts is not easy, however giving approximations with dipboxie sets from other 
languages may work. The problem then is that in producing prompts from a different 
phone set, the speaker is likely to mimic the prompts hence th? diphone set will 
probably seem to have a foreign pronunciation, especially for vow^ki Furthermore, 
mimicing the synthesizer too closely can remove some of the speaker '(ffhatural voice 
quality, which is under their (possibly subconscious) control to some aijgtee. 



Even when synthesizing prompts from an existing diphone set, you must be aware 
that that diphone set may contain errors or that certain examples will not be syn- 
thesized appropriately (e.g. consonant clusters). Because of this, it is still worthwhile 
monitoring the speaker to ensure they say things correctly. 

The basic code for generating the prompts is in src/diphone/diphlist . scm, 
and a specific example for DARFA phone set for American English in 



98 



Chapter 11. Diphone databases 



src/diphone/us_schema . scm. The prompts can be generated from the diphone list 
as described above (or at the same time). The example code produces the prompts 
and phone labels files which can be used by the aligning tool described below. 

Before synthesizing, the function Diphone_Prompt_Setup is called, if it has been de- 
fined. You should define this to set up the appropriate voices in Festival, as well 
as any other initialization you might need — for example, setting the fundamental 
frequency (FO) for the prompts that are to be delivered in a monotone (disregard- 
ing so-called microprosody, which is another matter). This value is set through the 
variable fp FO and should be near the middle of the range for the speaker, or at least 
somewhep^tjpmfortable to deliver. For the DARPAbet diphone list for KAL, we have: 

(define (Diprjgne_Prompt_Setup) 
" (Diphone_Prompt_Setup) 

Called before sjjnthesizing the prompt waveforms. Uses the kal_dphone 
voice for promptumand sets FO." 
(voice_kal_diphOT&>; US male voice 
(set! FP_F0 90) jnpWer FO than ked 

If the function Dipho^_Prompt_Word is defined, it will be called after the basic 
prompt- word utterance h^)been created, and before the actual waveform synthesis. 
This may be used to map phones to other phones, set durations or whatever you feel 
appropriate for your speaKfei%iiphone set. For the KAL set, we redefined the syl- 
labic consonants to their full qprisonant forms in the prompts, since the ked diphone 
database doesn't actually include 'syllables. Also, in the example below, instead of 
using fixed (100ms) durations w^aiake the diphones use a constant scaling factor 
(here, 1.2) times the average duratrorj^of the phones. 

(define (Diphone_Prompt_Word utt) 

"(Diphone_Prompt_Word utt) \^ * 

Specify specific modifications of the utterance before synthesis 
specific to this particular phone set." \ 
;; No syllables in kal so flip them to non-sylkfry; form 
(mapcar ^/K 
(lambda (s) \ 
(let ((n (item.name s))) 
(cond 

((string-equal n "el") 

(item.set_name s "1")) 
((string-equal n "em") 

(item.set_name s "m")) 
((string-equal n "en") 

(item.set_name s "n"))))) 
(utt.relation.items utt 'Segment)) 

(set! phoneme_durations kd_durs) \§\ 
(Parameter.set 'Duration_Stretch '1.2) v*{\ 
(Duration Averages utt)) ^ 

By convention, the prompt waveforms are saved in prompt-wav/^and their labels in 
prompt-lab/. The prompts may be generated after the diphone lipKis given using 
the following command: 

festival festvox/ us_schema.scm festvox/ diphlist.scm "<^> 
festival> (diphone-gen-schema "us" "etc/usdiph.list") 



If you already have a diphone list schema generated in the file etc/usdiphlist, you 
can do the following 



99 



Chapter 11. Diphone databases 



festival festvox/us_schema.scm festvox/ diphlist.scm 

festival> (diphone-gen-waves "prompt-wav" "prompt-lab" "etc/usdiph.list") 



Another useful example of the setup functions is to generate prompts for a language 
for which no synthesizer exists yet — to "bootstrap" from one language to another. A 
simple mapping can be given between the target phoneset and an existing synthe- 
sizer's phone set. We don't know if this will be sufficient to actually use as prompts, 
but we have found that it is suitable to use these prompts for automatic alignment. 

The exampte here is using the voice_kal_diphone speaker, a US English 
speaker, to^produce prompts for Japanese phone set, this code is in 

src/diphones£ja_schema . scm 



The function Di]ohone_Prompt_Setup calls the kal (US) voice, sets a suitable FO value, 
and sets the option { diph_do_db_boundaries to nil. This option allows the diphone 
boundaries to be \Mroped into the prompt label files, but this doesn't work when 
cross-language proffralting is done, as the actual phones don't match the desired ones. 

(define (Diphone_Promj3f\Setup) 
" (Diphone_Prompt_Setisf?L 

Called before synthesizing^e prompt waveforms. Cross language prompts 
from US male (for gaijin majT 



(voice_kal_diphone) ;; US maW voice 
(set! FP_F0 90) \> 
(set! diph_do_db_boundaries niU};*cross-lang confuses this 

At synthesis time, each Japanese -cykone must be mapped to an equivalent (one or 
more) US phone. This is done though\a simple table, set in nhg2radio_map which 
gives the closest phone or phones forthe Japanese phone (those unlisted remain the 
same). \ 

Our mapping table looks like this C^. 

(set! nhg2radio_map 
'((a aa) 



iiy) 
o ow) 
u uw) 
e eh) 
tsts) 
Nn) 
hhh) 
Qkk) 

Qgg) i<\ 

Qd d) 

Qtt) 



O 



Qtsts) r\ 



Qch t ch) 
Qjjh) 

m 

Qss) ^ 

Qsh sh) * 

Qzz) O 

Qpp) 

Qbb) 
Qky k y) 
Qshy sh y) 
Qchy ch y) 

Qpypy)) 

kyky) 

ygy) 
jyj h y) 



200 



Chapter 11. Diphone databases 



(chy ch y) 
(shy sh y) 
(hyhhy) 

(pypy) 

(by by) 
(my m y) 
(ny n y) 
(ry r y))) 

Phones that are not explicitly mentioned map to themselves (e.g. most of the conso- 
nants). ^ 

Finally we dgjjne Diphone_Prompt_word to actually do the mapping. Where the 
mapping invqljes more than one US phone we add an extra segment to the Seg- 
ment (defined ffrthe Festival manual) relation and split the duration equally between 
them. The basic function looks like 

(define (Diphone J3r6jjript_Word utt) 

"(Diphone_Prompf£Word utt) 
Specify specific modifications of the utterance before synthesis 
specific to this particulad'ghone set." 

(mapcar AT\ 

(lambda (s) 
(let ((n (item.name s)) 

(newn (cdr (assoc_string (rteja.name s) nhg2radio_map)))) 
(cond 

((cdr newn) ;; its a dual one 

(let ((newi (item.insert s (list (cariedr newn))) 'after))) 
(item.set_feat newi "end" (item.f^at s "end")) 
(item.set_feat s "end" 
(/ (+ (item.feat s "segment_start") 
(item.feat s "end")) \ 

2)) \>' 

(item.set_name s (car newn)))) 
(newn >■ » 

(item.set_name s (car newn))) \> 

(' . 

;; as is > _ 

)))) CD 

(utt.relation.items utt 'Segment)) 

The label file produced from this will have the original desired language phones, 
while the acoustic waveform will actually consist ofpkones in the target language. 
Although this may seem like cheating, we have found-mis to work for Korean and 
Japanese from English, and is likely to work over manyjjj&er language combination 
pairs. For autolabeling as the nonse word phone names -afa pre-defined alignment 
just needs to be the best matching path and as long as the phones are distinctive from 
the ones around them this alignment method is likely to wc 




Recording the diphones q 

The object of recording diphones is to get as uniform a set of pronunciations as possi- 
ble. Your speaker should be relaxed, not be suffering for a cold, or couglAor a hang- 
over. If something goes wrong with the recording and some of the examples need 
to be re-recorded it is important that the speaker has as similar a voice as with the 
original recording, waiting for another cold to come along is not reasonable, (though 
some may argue that the same hangover can easily be induced). Also to try to keep 
the voice potentially repeatable it is wise to record at the same time of day, morning 
is a good idea. The points on speaker selection and recording in the previous section 
should also be borne in mind. 

201 



Chapter 11. Diphone databases 



The recording environment should be reconstructable, so that the conditions can be 
set up again if needed. Everything should be as well-defined as possible, as far as gain 
settings, microphone distances, and so on. Anechoic chambers are best, but general 
recording studios will do. We've even done recording in an open room, with care this 
works (make sure there's little background noise from computers, air conditioning, 
outside traffic etc). Of course open rooms aren't ideal but they are better than open 
noisey rooms. 

The distance between the speaker and the microphone is crucial. A head mounted 
mike helps keep this constant; the Shure SM-2 headset, for instance, works well with 
the mic pwiiioned at 8mm from the lips or so. This can be checked with a ruler. 
Consideringjflre cost and availability of headmounted microphones and rulers, you 
should reallyrpnsider using them. While even fixed microphones like the Shure SM- 
57 can be usec^^vell by professional voice talent, we strongly recommend a good 
headset mic. « 

Ultimately, you n5e$^ to split the recordings into individual files, one for each prompt. 
Ideally this can be ■dt5jie while recording on a file-by-file basis, but as that may not 
be practical and sorrtelother technique can be used, such as recording onto DAT and 
transferring the data «3jiisk (and downsampling) later. Files might contain 50-100 
nonsense words each, mjhis case we hand label the words, taking into account any 
duplicates caused be errofSTin the recording. The program ch_wave in the Edinburgh 
Speech Tools (EST) offers a fwwction to split a large file into individual files based on a 
label file. We can use this to^gWour individual files. You may also add an identifiable 
noise during recording and automatically detect that as a split point, as is often done 
at the Oregon Graduate InstitifajM. They typically use two different noises that can 
easily be distinguished and useijpe for "OK" and "BAD" this can make the splitting 
of the files into the individual nonsense words easier. Note you will also need to split 
the electroglottograph (EGG) signat^kactly the same way, if you are using one. 

No matter how you split these, youCsnould be aware that there will still often be 
mistakes, and checking by listening wrl^help. 

We now almost always record directly D^lisk on a computer using a sound card; 
see the Section called Recording under UnixAh^Chapter 4 for recording setup details. 
There can be a reduction in the quality of tngrrecording due to poor quality audio 
hardware in computers (and often too mucR noise), though at least the hardware 
issue is getting to be less of a problem these ^Sj^s. There are lots of advantages to 
recording directly to disk, as the stage of digitising, transfering and spliting the offline 

, ^ 

Labelinq the diphones o 

Labeling nonsense words is much easier than labeling v wntinuous speech, whether 
it is by hand or automatically. With nonsense words, it is s»mpletely defined which 
phones are there and they are (hopefully) clearly articulate<i(_) 

We have had significant experience in hand labeling diphodgg) and with the right 
tools it can be done fairly quickly (e.g. 20 hours for 2500 nonseprsfe words) even if it 
is a mind-numbing exercise which your voice talent may offer you little sympathy 
for after you've made them babble for hours in a box with electrotles on their throat 
(optional). But labeling can't realistically be done for more than a^QSkmr or two at 
any one time. As a minimum, the start of the preceding phone to jjfi^ first phone 
in the diphone, the changeover, and the end of the second phone irythe diphone 
should be labeled. Note we recommend phone boundary labeling as ffim is much 
better defined than phone middle marking. The diphone will, by default be extracted 
from the middle of phone one to the middle of phone two. 

Your data set conventions may include the labeling of closures within stops explicitly. 
Thus you would expect the label tcl at the end of the silence part of a /t/ and a 
label t after the burst. This way the diphone boundary can automatically be placed 
within the silence part of the stop. The label db can be used when explicit diphone 

202 



Chapter 11. Diphone databases 



boundaries are desireable; this is useful within phones such as diphthongs where the 
temporal middle need not be the most stable part. 

Another place when specific diphone boundaries are recommended is in the phone- 
to-silence diphones. The phones at the end of words are typically longer than word 
internal phones, and tend to trail off in energy. Thus the midpoint of a phone immedi- 
ately before a silence typically has much less energy than the midpoint of a word in- 
ternal phone. Thus, when a diphone is to be concatenated to a phone-silence diphone, 
there would be a big jump in energy (as well as other related spectral characteristics). 
Our solution to this is explicitly label a diphone boundary near the beginning of the 
phone bef^a the silence (about 20% in) where the energy is much closer to what it 
will be in tng^liphone that will precede it. 

If you are usiBHjexplicit closures, it is worth noting that stops at the start of words 
don't seem to nave a closure part; however it is a good idea to actually label one 
anyway, if you 9re doing this by hand. Just "steal" a suitable short piece of silence 
from the precediii^part of the waveform. 



Because the words will often have very varying amounts of silence around them, it 
is a good idea to labelQ^lultiple silences around the word, so that the silence immedi- 
ately before the first pn^f*e is about 200-300 ms, and labeling the silence before that as 
another phone; likewisewtth the final silence. Also, as the final phone before the end 
silence may trail off, we r^ommend that the end of the last phone come at the very 
end of any signal thus app^) to include silence within it. Then label the real silence 
(200-300 ms) after it. The reasp&for this is if the end silence happens to include some 
part of the spoken signal, ancHf.this is duplicated, as is the case when duration is 
elongated, an audible buzz carf^Je introduced. 

Because labeling of diphone nons^alse words is such a constrained task we have in- 
cluded a program for automaticaH^providing a labeling for the spoken prompts. 
This requires that prompts be generated for the diphone database. The aligner uses 
those prompts to do the aligning. Though its not actually necessary that the prompts 
were used as prompts they do need to d6 generated for this alignment process. This is 
not the only means for alignment; you ma(yalso, for instance, use a speech recognizer, 
such as CMU Sphinx, to segment (align) th^Jata. 

The idea behind the aligner is to take the p^dmpt and the spoken form and derive 
mel-scale cepstral parameterizations (and theiprcleltas) of the files. Then a DTW (dy- 
namic time warping) algorithm is used to find the best alignment between these two 
sets of features. Then the prompt label file is used to index through the alignment 
to give a label file for the spoken nonsense word.YThis is largely based on the tech- 
niques described in [malfrere97], though this generaptechnique has been used for 
many years. \ _^ 

We have tested this aligner on a number of existing^hand-labeled databases to 
compare the quality of the alignments with respect tOvjh© hand labeling. We have 
also tested aligning prompts generated from a language^Jlfferent from that being 
recorded. To do this there needs to be reasonable mappin^between the language 
phonesets. .~ 

Here are results for automatically finding labels for the ked (USTaiJglish) by aligning 
them against prompts generated by three different voices v 

ked itself O 
mean error 14.77ms stddev 17.08 o, 

mwm (US English) 

mean error 27.23ms stddev 28.95 

gsw (UK English) 

mean error 25.25ms stddev 23.923 



103 



Chapter 11. Diphone databases 



Note that gsw actually gives better results than mwm, even though it is a different 
dialect of English. We built three diphone index files from each of the label sets gen- 
erated from there alignment processes, ked-to-ked was the best, and only marginally 
worse that the database made from the manually produced labels. The database from 
mwm and gsw produced labels were a little worse but not unacceptably so. Consider- 
ing a significant amount of careful corrections were made to the manually produced 
labels, these automatically produced labels are still significantly better than the first 
pass of hand labels. 

A further experiment was made across languages; the ked diphones were used as 
prompts tg^iljgn a set of Korean diphones. Even though there are a number of phones 
in Korean ng£present in English (various forms of aspirated consonants), the results 
are quite usaOT^ 

Whether you use hand labeling or automatic alignment, it is always worthwhile do- 
ing some hand-cbrrection after the basic database is built. Mistakes (sometimes sys- 
tematic) always o^c^iir and listening to substantial subset of the diphones (or them all 
if you resynthesizevtrre nonsense words) is definitely worth the time in finding bad 
diphones. The diva fsan. the details. 

The script f estvox/s]^Qdiphones/make_labs will process a set of prompts and 
their spoken (recorded) >mqTi generating a set of label files, to the best of its ability. 
The script expects the following to already exist 

prompt-wav/ <J 

The waveforms as synthesized by Festival 

prompt-lab/ \ft 

The label files corresponding to tja^ synthesized prompts in prompt-wav. 

prompt-cep/ \^ 

The directory where the cepstral fea{£r?e streams for each prompt will be saved. 

wav/ 

The directory holding the nonsense woreK spoken by your voice talent. The 
should have the same file id as the waveforms in prompt-wav/. 

cep/ ^ 

The directory where the cepstral feature strean^for the recorded material will 
be saved. q 

lab/ 

The directory where the generated label files for the s words in wav/ will be 
saved. VO 
To run the script over the prompt waveforms O 

bin/ make_labs prompt-wav/*.wav 

The script is written so it may be use used in parallel on multiple(p^chines if you 
want to distribute the process. On a Pentium Pro 200MHz, which you j£n)bably won't 
be able to find any more, a 2000 word diphone databases can be labelad^in about 30 
minutes. Most of that time is in generating the cepstrum coefficients. Thiris down to 
a few minutes at most on a dual Pentium III 550. 

Once the nonsense words have been labeled, you need to build a diphone index. 
The index identifies which diphone comes from which files, and from where. 
This can be automatically built from the label files (mostly). The Festival script 
f estvox/src/diphones/make_diph_index will take the diphone list (as used 
above), find the occurrence of each diphone in the label files, and build an index. 

204 



Chapter 11. Diphone databases 



The index consists of a simple header, followed by a single line for each diphone: the 
diphone name, the fileid, start time, mid-point (i.e. the phone boundary) and end 
time. The times are given in seconds (note that early versions of Festival, using a 
different diphone synthesizer module, used milliseconds for this. If you have such 
an old version of Festival, it's time to update it). 

An example from the start of a diphone index file is 



EST_File index 
DataType ascii 
NumEnt^a 1610 
IndexNamgi;ed2_diphone 
EST_Heade?£rid 
y-aw kdl_00^j!35 0.500 0.560 
y-ao kdl_003 0.400 0.450 0.510 
y-uw kdl_004 0*345 0.400 0.435 
y-aa kdl_005 0.25^.310 0.365 
y-ey kdl_006 0.245#M0 0.370 
y-ay kdl_008 0.250 0f32Q 0.380 
y-oy kdl_009 0.260 0.§M 0.370 
y-ow kdl_010 0.245 0.3Q6^-345 
y-uhkdl_011 0.240 0.309^330 
y-ih kdl_012 0.240 0.290 oQ&) 
y-eh kdl_013 0.245 0.310 0.3/fB) 
y-ah kdl_014 0.305 0.350 0.395^ 

Note the number of entries field, must be correct; if it is too small it will (often 
confusingly) ignore the entries after that point. 

This file can be created with a diphoru^-ist file and the lab files in by the command 

$FESTVOXDIR/src/diphones/make_difrrL index etc/usdiph.list dic/kaldiph.est 

You should check that this has successfully found all the named diphones. When 
an diphone is not found in a label file, an emfv^with zeroes for the start, middle, and 
end is generated, which will produce a wanting when being used in Festival, but it 
is worth checking in advance. ^ 

The make_diph_index program will take the mjdpoint between phone boundaries 
for the diphone boundary, unless otherwise specified with the label db. It will also 
automatically remove underscores and dollar symbols>from the diphone names be- 
fore searching for the diphone in the label file, and it tym only find the first occurrence 
of the diphone. Q 

Extracting the pitchmarks 

Festival, in its publically distributed form, currently only surfptorts residual excited 
Linear-Predictive Coding (LPC) resynthesis [hunt89]. It doe&( support PSOLA 
[moulines90], though this is not distributed in the public version. Both of these 
techniques are pitch synchronous, that is there require information «bout where pitch 
periods occur in the acoustic signal. Where possible, it is better t<(Tr£cord with an 
electroglottograph (EGG, also known as a laryngograph) at the same* time as the 
voice signal. The EGG records electrical activity in the glottis duringSnaech, which 
makes it easier to get the pitch moments, and so they can be more precisely found. 

Although extracting pitch periods from the EGG signal is not trivial, it is fairly 
straightforward in practice, as The Edinburgh Speech Tools include a program 
pitchmark which will process the EGG signal giving a set of pitchmarks. However 
it is not fully automatic and requires someone to look at the result and make some 
decisions to change parameters that may improve the result. 



205 



Chapter 11. Diphone databases 



The first major issue in processing the signal is deciding which way is up. From our 
experience, we have seen the signal inverted in some cases and it is necessary to iden- 
tify the direction in order for the rest of the processing to work properly. In general 
we've found the CSTR's LAR output is upside down while OGI's and CMU's output 
is the right way up, though this can even flip from file to file. If you find inverted 
signals, you should add -inv to the arguments to pitchmark. 

The object is to produce a single mark at the peak of each pitch period and "fake" 
or "phantom" periods during unvoiced regions. The basic command we have found 
that works for us is 

pitchmark iar/fileOOl .lar -o pm/fileOOl.pm -otype est \ 
-min O.OOtf^nax 0.012 -fill -def 0.01 -wave_end 

It is worth dijing one or two by hand and confirming that a reasonable pitch 
periods are found. Chfote that the -min and -max arguments are speaker-dependent. 
This can be movecWtowards the fixed F0 point used in the prompts, though 
remember the speaker will not have been exactly constant. The script 
f estvox/src/gener;(^make_pm can be copied and modified (for the particular 
pitch range) and run toygejierate the pitchmarks 

bin/make_pm lar/*. lar 

h 

If you don't have an EGG sigrf^for your diphones, the alternative is to extract the 
pitch periods using some other ^ifeial processing function. Finding the pitch peri- 
ods is similar to finding the F0 contour and, although harder than finding it from 
the EGG signal, with clean laboratory-recorded speech, such as diphones, it is possi- 
ble. The following script is a modific^tibn of the make_pm script above for extracting 
pitchmarks from a raw waveform signaLSt is not as good as extracting from the EGG 
file, but it works. It is more computationally intensive, as it requires rather high order 
filters. The value should change depending on the speaker's pitch range. 

for i in $* 

do A> 

fname='basename $i .wav' ^ v 

echo $i • . 

$ESTDIR/bin/ch_wave -scaleN 0.9 $i -F 16000 -o /tef»/tmp$$.wav 

$ESTDIR/bin/pitchmark /tmp/tmp$$.wav -o pm/$fname.pm \ 
-otype est -min 0.005 -max 0.012 -fill -def 0.01 \ \^ 
-wave_end -lx_lf 200 -lx_lo 71 -bc_hf 80 -lx_ho 71 -vfiedo 0 

done 

If you are extracting pitch periods automatically, it is w^jrth taking more care to 
check the signal. We have found that recording consistencya&d bad pitch extraction 
the two most common causes of poor quality synthesis. ^ 

See the Section called Extracting pitchmarks from waveforms in Cfta*>ter 4 for a more 
detailed discussion on how to do this. \ 

o 

Building LPC parameters Q 

Currently the only publically distributed signal processing method estival is 
residual excited LPC. To use this, you must extract LPC parameters and LPC residual 
files for each file in the diphone database. Ideally, the LPC analysis should be done 
pitch-synchronously, thus requiring that pitch marks are created before the LPC anal- 
ysis takes place. 

A script suitable for generating the LPC coefficients and residuals is given in 
f estvox/src/general/make_lpc and is repeated here. 

206 



Chapter 11. Diphone databases 



for i in $* 
do 

fname='basename $i .wav' 
echo $i 



# Potential normalise the power 

#$ESTDIR/bin/ch_wave -scaleN 0.5 $i -o /tmp/tmp$$.wav 

# resampling can be done now too 

#$ESTDIR/bin/ch_wave -F 11025 $i -o /tmp/tmp$$.wav 

# Or use as is 

cp -p $^£tmp / tmp$$.wav 

$ESTDIft/J}in/sig2fv /tmp/tmp$$.wav -o lpc/$fname.lpc \ 

-otjragjgst -lpc_order 16 -coefs "lpc" \ 

-pm p^gJ/Sfname.pm -preemph 0.95 -factor 3 \ 

-windmflMype hamming 
$ESTDIR/biiVsigfilter /tmp/tmp$$.wav -o lpc/$fname.res \ 

-otype nisWpcfilter lpc/$fname.lpc -inv_filter 
rm / tmp / tmp$$Avg# 
done v O 



Note the (optional) u»|^f ch_wave to attempt to normalize the power in the wave to 
a percentage of its maximum. This is a very crude method for making the waveforms 
have a reasonably equiv£(f^)it power. Wildly different power fluctuations in power 
between segments is likely Wfye noticed when they are joined. Differing power in the 
nonsense words may occur if jpst enough care has been taking in the recording. Either 
the settings on the recording equipment have been changed (bad) or the speaker has 
changed their vocal effort (wof^. It is important that this should be avoided as the 
above normalization does not m^s the problem of different power go away it only 
makes the problem slightly less batL. 

A more elaborate power normaliziaticW has been successful, but it is a little harder, 
though it was definitely successful for the. KED US American voice that had major 
power fluctuations over different recoi^Kng sesssions. The idea is to find the power 
during vowels in each nonsense word, £nsn find the mean power for each vowel 
overall files. Then, for each file, find the average factor difference for each actual 
vowel with the mean for that vowel and scarethe waveform according to that value. 
We now provided a basic script which does %is 

bin/find_powerfacts lab/*.lab 

This script creates (among others) etc/powf act^^hich if it exists, is used to nor- 
malize the power of each waveform file during the forking of the LPC coefficients. 

We generate a set of ch_wave commands that extract t^g)parts of the wave from that 
are vowels (using -start and -end options, make the oi<fffrut be in ascii -otype raw 
-ostype ascii and use a simple script to calculate the^MS power. We then calcu- 
late the mean power for each vowel with another awk se?igt using the result as a 
table, then finally we process the fileid, actual vowel powisr^rjiformation to generate 
a power factor to by averaging the ration of each vowel's actitfaj power to the mean 
power for that vowel. You may wish to still modify the powerni^ther after this if it 
is too low or high. >> 

Note that power normalization is intended to remove artifacts caused by different 
recording environment, i.e. the person moved from the microphone*. the levels were 
changed etc. they should not modify the intrinsic power differences(Ii) the phones 
themselves. The above techniques try to preserve the intrinsic power, vpfoch is why 
we take the average over all vowels in a nonsense word, though you shorlid listen to 
the results and make the ultimate decision yourself. 

If all has been recorded properly, of course, individual power modification should be 
unnecessary. Once again, we can't stress enough how important it is to have good 
and consistent recording conditions, so as to avoid steps like this. 



207 



Chapter 11. Diphone databases 



If you want to generate a database using a different sampling rate than the recordings 
were made with, this is the time to resample. For example an 8KHz or 11.025KHz 
will be smaller than a 16KHz database. If the eventual voice is to be played over the 
telephone, for example, there is little point in generating anything but 8Khz. Also it 
will be faster to synthesize 8Khz utterances than 16Khz ones. 

The number of LPC coefficients used to represent each pitch period can be changed 
depending on sample rate you choose. Hearsay, reasonable experience, and perhaps 
some theoretical underpining, suggests the following formula for calculating the or- 
der 

(sample_ra4^1000)+2 

But that shou1?K>nry be taken as a rough guide though a larger sample rate deserves 
a greater number of coeeficients. 



0161 



Defining a diphone voT^e, 

The easiest way to defirtO voice is to start from the skeleton scheme files distributed. 
For English voices see CMExter 20, and for non-English voices see Chapter 19 for de- 
tailed walkthroughs. 

Although in many cases you$l*want to modify these files (sometimes quite substan- 
tially), the basic skeleton files vcall give you a good grounding, and they follow some 
basic conventions of voice files^fnat will make it easier to integrate your voice into 
the Festival system. v^S 

\ 

Checking and correcting diphones^ 

This probably sounds like we're repeatiwg\ ourselves here, and we are, because it's 
quite important for the overall quality oL-the voice: once you have the basic di- 
phone database working it is worthwhile Systematically testing it as it is common 
to have mistakes. These may be mislabeling^and mispronunciation for the phones 
themselves. Two possible strategies are possib^for testing both of which have their 
advantages. This first is a simple exhaustive synthesis of all diphones. Ideally, the 
diphone prompts are exactly the set of utterance's that test each and every diphone. 
using the SayPhones function you can synthesize(aj)Ld listen to each prompt. Actu- 
ally, for a first pass, it may even be useful to synthes^ze^each nonsense word without 
listening as some of the problems missing files, missive diphones, badly extracted 
pitchmarks will show up without you having to listenHeJat all. 

When a problem occurs, trace back why, check the entr^frvthe diphone index, then 
check the label for the nonsense word, then check how thaMfabel matches the actually 
waveform file itself (display the waveform with the label fijtfajnd spectrogram to see 
if the label is correct). 

Listing all the problems that could occur is impossible. What yovr^eed to do is break 
down the problem and find out where it might be occurring. If ybu just get apparent 
garbage being synthesized, take a look at the synthesized wavefoftrt^ 

(set! uttl (SayPhones '(pau hh ah 1 ow pau))) ^) 
(utt.save.wave uttl "hello.wav") 

Is it garbage, can you recognized any part of it? It could be a byte swap problem or a 
format problem for your files. Can your nonsense word file be played and displayed 
as is? Can your LPC residual files be played and displayed. Residual files should 
look like very low powered waveform files and sound very buzzy when played but 
almost recognizable if you know what is being said (sort of like Kenny from South 
Park). 

108 



Chapter 11. Diphone databases 



If you can recognize some of what is being said but it is fairly uniformly garbled it is 
possible your pitchmarks are not being aligned properly Use some display mech- 
anism to see where the pitchmarks are. These should be aligned (during voiced 
speech) with the peaks in the signal. 

If all is well except for some parts of the signal are bad or overflowed, then check the 
diphone where the errors occur. 

There are a number of solutions to problems that may save you some time, for the 
most part they should be considered cheating, but they may save having to re-record, 
which is something that you will probably want to avoid if at all possible. 

Note that sqgte phones are very similar, particular the left half side of most stops are 
indistinguishable, as the consist of mostly silence. Thus if you find you didn't get 
a good somet^isg-p diphone you can easily make it use the S0METHiNG-b diphone 
instead. You can,do this by hand editing the diphone index file accordingly. 

The linguists amc^rjg. you may not find that acceptable, but you can go further, the 
burst part of /p/ artery b/ isn't that different when it comes down to it and if is it just 
one or two diphones^voii can simply map those too. Considering problems are often 
in one or two badly aS^rcuJated phones replace a /p/ with a /b/ (or similar) in one 
or two diphones may rimjbe that bad. 

Once, however, the pri^ems become systematic over a number of phones 
re-recording them should (^b considered. Though remember if you do have to 
re-record you want to havens similar an environment as possible which is not 
always easy. Eventually you n*ay need to re-record the whole database again. 

Recording diphone databases isripi an exact science, although we have a fair amount 
of experience in recording these databases, they never completely go as planned. 
Some apparently minor problem ofl^i occurs, noise on the channel, slightly different 
power over two sessions. Even when-^erything seems the same and we can't iden- 
tify any difference between two recording environments we have found that some 
voices are better than others for buildiltfg diphone databases. We can't immediately 
say why, we discussed some of these issi^es above in selecting a speaker but there is 
still some other parameters which we can'i^dentify so don't be disheartened when 
you database isn't as good as you hoped, otfr^ometimes fail too. 

Diphone check list • > 

The section contains a quick check list of the proWssss required to constructing a 
working diphone database. Each part is discussed iri^detail above. 

o 



Choose phoneset: Find an appropriate phoneset for tkibrranguage, if possible using 
an existing standard. If you already have a good lexicot^in the desired language, 
we recommend that you use that phone set. 

Construct diphone list: Construct the diphone list with appropriate carrier words. 
Either using an existing list or generating one from the exarnp^s. Consider what 
allophones, consonant clusters, etc., you also wish to record. >■ 

Synthesize prompts: Synthesize prompts from an existing voice, i£«possible. Even 
when a few phones are missing from that voice it can still be useful to have the 
speaker listen to prompts as it keeps then focussed on minimal prc(s^dy and nor- 
malized vocal effort as well as reminding them what they need to sayr^ 

Record words: Record the words in the best possible conditions you can. Bad 
recordings can never be corrected later. Ideally, you would use an anechoic cham- 
ber with voice from close talking mike and larynograph channels. 

Hand label/align phones: If you used prompts you can probably use the provided 
aligner to get a reasonable first pass at the phone labels. Alternatively, find a dif- 
ferent aligner, or do it by hand. 

209 



Chapter 11. Diphone databases 



Extract pitchmarks: Extract the pitchmarks from the recorded signal, either from 
the EGG signal, or by the more complicated approach of extracting them from the 
speech signal itself. 

Build parameter files: If you don't have PSOLA, extract the LPC parameters and 
residuals from the speech signal, with power normalization if you feel its neces- 
sary. 

Build database itself: Build the diphone index, correcting any obvious labeling er- 
rors then test the database itself. Running significant tests to correct any further 
labeling^rors. 

Test and gtteck database: Systematically check the database by synthesizing the 
prompts aga^n and synthesizing general text. 



X 



'•6 



o 



% 



110 



Chapter 12. Unit selection databases 



This chapter discusses some of the options for building waveform synthesizers us- 
ing unit selection techniques in Festival. This is still very much an on-going research 
question and we are still adding new techniques as well as improving existing ones 
often so the techniques described here are not as mature as the techniques as de- 
scribed in previous diphone chapter. 

By "unit selection" we actually mean the selection of some unit of speech which may 
be anythinafrom whole phrase down to diphone (or even smaller). Technically di- 
phone selection is a simple case of this. However typically what we mean is unlike 
diphone sele^on, in unit selection there is more than one example of the unit and 
some mechanism is used to select between them at run-time. 



ATR's CHATR ^hunt96] system and earlier work at that lab [nuutalk92] is an ex- 
cellent example of jane particular method for selecting between multiple examples 
of a phone withirOff database. For a discussion of why a more generalized inven- 
tory of units is desire^ see [campbell96] though we will reiterate some of the points 
here. With diphones'a^ixed view of the possible space of speech units has been made 
which we all know is nDtMeal. There are articulatory effects which go over more than 
one phone, e.g. /s/ can i^ke on artifacts of the roundness of the following vowel even 
over an intermediate stoj^^.g. "spout" vs "spit". But its not just obvious segmental 
effects that cause variation/in pronunciation, syllable position, word/phrase initial 
and final position have typical]^/ a different level of articulation from segments taken 
from word internal position. Stressing and accents also cause differences. Rather than 
try to explicitly list the desirecLmventory of all these phenomena and then have to 
record all of them a potential alt^teative is to take a natural distribution of speech 
and (semi-)automatically find the^aistinctions that actually exist rather predefining 
them. \y 

The theory is obvious but the design, of such systems and finding the appropriate 
selection criteria, weighting the costs (^relative candidates is a non-trivial problem. 
However techniques like this often produCa very high quality, very natural sounding 
synthesis. However they also can produce>*ome very bad synthesis too, when the 
database has unexpected holes and/ or the selection costs fail. 



Two forms of unit selection will discussed here, not because we feel they are the 
best but simply because they are the ones acs^ally implemented by us and hence 
can be distributed. These should still be considered research systems. Unless you are 
specifically interested or have the expertise in devak5ping new selection techniques it 
is not recommended that you try these, if you neea-a working voice within a month 
and can't afford to miss that deadline then the dipfione option is safe, well tried 
and stable. In you need higher quality and know som^ffijing about what you need to 
say, then we recommend the limited domain technique«"discussed in the following 
chapter. The limited domain synthesis offers the high^fiality of unit selection but 
avoids much (all ?) of the bad selections. \J *. 

Cluster unit selection 

This is a reimplementation of the techniques as described in [black97c]. The idea is 
to take a database of general speech and try to cluster each phone type into groups 
of acoustically similar units based on the (non-acoustic) informatiomavailable at syn- 
thesis time, such as phonetic context, prosodic features (FO and durati!©) and higher 
level features such as stressing, word position, and accents. The acttntply features 
used may easily be changed and experimented with as can the definitiorfof the defi- 
nition of acoustic distance between the units in a cluster. 

In some sense this work builds on the results of both the CHATR selection algorithm 
[hunt96] and the work of [donovan95], but differs in some important and significant 
ways. Specifically in contrast to [hunt96] this cluster algorithm pre-builds CART trees 
to select the appropriate cluster of candidate phones thus avoiding the computation- 



222 



Chapter 12. Unit selection databases 



the same phone type. 

Dump selection features (p'^ogie context, prosodic, positional and whatever) for 



ally expensive function of calculating target costs (through linear regression) at se- 
lection time. Secondly because the clusters are built directly from the acoustic scores 
and target features, a target estimation function isn't required removing the need to 
calculate weights for each feature. This cluster method differs from the clustering 
method in [donovan95] in that it can use more generalized features in clustering and 
uses a different acoustic cost function (Donovan uses HMMs), also his work is based 
on sub-phonetic units (HMM states). Also Donovan selects one candidate while here 
we select a group of candidates and finds the best overall selection by finding the 
best path through each set of candidates for each target phone, in a manner similar 
to [hunt96^nd [iwahashi93] before. 

The basic pr^esses involved in building a waveform synthesizer for the clustering 
algorithm areas follows. A high level walkthrough of the scripts to run is given after 
these lower lewMetails. 

• Collect the data^fee of general speech. 

• Building utterance^tructures for your database using the techniques discussed in 
the Section called Utterance building in Chapter 3. 

• Building coefficients toJ) acoustic distances, typically some form of cepstrum plus 
FO, or some pitch syncfWwious analysis (e.g. LPC). 

Build distances tables, pr®salculating the acoustic distance between each unit of 

\> 

each unit type. 

• Build cluster trees using wagon*with the features and acoustic distances dumped 
by the previous two stages \) ., 

• Building the voice description itself. , 

x 

Choosing the right unit type ^\ 

before you start you must make a decision ff^put what unit type you are going 
to use. Note there are two dimensions here. First is size, such as phone, diphone, 
demi-syllable. The second type itself which may fce simple phone, phone plus stress, 
phone plus word etc. The code here and the relateqjlles basically assume unit size is 
phone. However because you may also include a perpentage of the previous unit in 
the acoustic distance measure this unit size is more effectively phone plus previous 
phone, thus it is somewhat diphone like. The cluster method has actual restrictions 
on the unit size, it simply clusters the given acoustic un^sj)vith the given feature, but 
the basic synthesis code is currently assuming phone size^J^nits. 

The second dimension, type, is very open and we expect tl^f^ontrolling this will be 
a good method to attained high quality general unit selection/synthesis. The param- 
eter clunit_name_f eat may be used define the unit type. Tne^wnplest conceptual 
example is the one used in the limited domain synthesis. There(jve distinguish each 
phone with the word it comes from, thus a d from the word limited is distinct from 
the d in the word domain. Such distinctions can hard partition up the^seace of phones 
into types that can be more manageable. ^ 

The decision of how to carve up that space depends largely on the ^tended use 
of the database. The more distinctions you make less you depend on tr\i?clustering 
acoustic distance, but the more you depend on your labels (and the speech) being (ab- 
solutely) correct. The mechanism to define the unit type is through a (typically) user 
defined feature function. In the given setup scripts this feature function will be called 
lisp_iNST_LANG_NAME : : clunit_name. Thus the voice simply defines the function 
ins t_l an G_N ame : : clunit_name to return the unit type for the given segment. If you 
wanted to make a diphone unit selection voice this function could simply be 



222 



Chapter 12. Unit selection databases 



(define (INST_L ANG_N AME : : clunit_name i) 
(string_append 
(item.name i) 

(item.feat i "p.name"))) 



This the unittype would be the phone plus its previous phone. Note that the first 
part of a unit name is assumed to be the phone name in various parts of the code 
thus although you make think it would be neater to return previousphone_phone 
that would mess up some other parts of the code. 

In the limite&domain case the word is attached to the phone. You can also consider 
some demi-sjjliable information or more to differentiate between different instances 
of the same ph^rje. 

The important thing to remember is that at synthesis time the same function is called 
to identify the unittype which is used to select the appropriate cluster tree to select 
from. Thus you need<to ensure that if you use say diphones that the your database 
really does not hzve^m diphones in it. 

Collecting database^for unit selection 

Unlike diphone database \^)ch are carefully constructed to ensure specific cover- 
age, one of the advantages o£-<kiit selection is that a much more general database is 
desired. However, although Voices may be built from existing data not specifically 
gathered for synthesis there are^till factors about the data that will help make better 
synthesis. v O 

Like diphone databases the more tJfWdy and carefully the speech is recorded the 
better the synthesized voice will be. we are going to be selecting units from dif- 
ferent parts of the database the more similar the recordings are, the less likely bad 
joins will occur. However unlike diphoXeSjdatabase, prosodic variation is probably a 
good thing, as it is those variations that cap make synthesis from unit selection sound 
more natural. Good phonetic coverage is aflso useful, at least phone coverage if not 
complete diphone coverage. Also synthesis'u^ng these techniques seems to retain 
aspects of the original database. If the database is broadcast news stories, the syn- 
thesis from it will typically sound like read nsjvs stories (or more importantly will 
sound best when it is reading news stories). 



Although it is too early to make definitive staterf^ei)ts about what size and type of 
data is best for unit selection we do have some rough-guides. A Timit like database 
of 460 phonetically balanced sentences (around 14,000-phones) is not an unreason- 
able first choice. If the text has not been specifically selected for phonetic coverage a 
larger database is probably required, for example the Boypn University Radio News 
Corpus speaker f 2b [ostendorf95] has been used relatively fftjccessfully. Of course all 
this depends on what use you wish to make of the synthesiser, if its to be used in 
more restrictive environments (as is often the case) tailoringthe database for the task 
is a very good idea. If you are going to be reading a lot of telep@ne numbers, having 
a significant number of examples of read numbers will make synthesis of numbers 
sound much better (see the following chapter on making such design more explicit). 

• 

The database used as an example here is a TIMIT 460 sentence database read by an 
American male speaker. v. 

Again the notes about recording the database apply, though it will sometimes be the 
case that the database is already recorded and beyond your control, in tra?t case you 
will always have something legitimate to blame for poor quality synthesis. 



223 



Chapter 12. Unit selection databases 



Preliminaries 

Throughout our discussion we will assume the following database layout. It is highly 
recommended that you follow this format otherwise scripts, and examples will fail. 
There are many ways to organize databases and many of such choices are arbitrary, 
here is our "arbitrary" layout. 

The basic database directory should contain the following directories 



bin/ 

Any cBt&base specific scripts for processing. Typically this first contains a copy 
of standgrgd scripts that are then customized when necessary to the particular 
database <^ 

wav/ • 

The waveforrt^kjes. These should be headered, one utterances per file with a 
standard name v cpnvention. They should have the extension . wav and the fileid 
consistent with sih other files through the database (labels, utterances, pitch 
marks etc). 

lab/ (^) 

The segmental labels(^Tiis is usually the master label files, these may 
contain more inf ormati^nS that the labels used by festival which will be in 

festival/ relations/ Se^m^jat/. 

lar/ v£) 

The EGG files (larynograph fil^^if collected. 

festival/ ^> 
Festival specific label files. 

festival/relations/ , 

The processed labeled files for building ^^tival utterances, held in direc- 
tories whose name reflects the relation they-Tepresent: Segment/, word/, 
Syllable/ etc. 

f estival/utt s / 

The utterances files as generated from the festive/relations/ label files. 



pm/ 

Pitchmark files as generated from fne-^r files or from the signal directly. 




Other directories will be created for various processing reaso: 

Building utterance structures for unit selection 

In order to make access well defined you need to construct Festiva£arkerance struc- 
tures for each of the utterances in your database. This (in is basic form)Qgbuires labels 
for: segments, syllables, words, phrases, FO Targets, and intonation evlAts. Ideally 
these should all be carefully hand labeled but in most cases that's impracncal. There 
are ways to automatically obtain most of these labels but you should be aware of the 
inherit errors in the labeling system you use (including labeling systems that involve 
human labelers). Note that when a unit selection method is to be used that funda- 
mentally uses segment boundaries its quality is going to be ultimately determined 
by the quality of the segmental labels in the databases. 



224 



Chapter 12. Unit selection databases 



For the unit selection algorithm described below the segmental labels should be 
using the same phoneset as used in the actual synthesis voice. However a more 
detailed phonetic labeling may be more useful (e.g. marking closures in stops) 
mapping that information back to the phone labels before actual use. Autoaligned 
databases typically aren't accurate enough for use in unit selection. Most 
autoaligners are built using speech recognition technology where actual phone 
boundaries are not the primary measure of success. General speech recognition 
systems primarily measure words correct (or more usefully semantically correct) 
and do not require phone boundaries to be accurate. If the database is to be used for 
unit selectj^n it is very important that the phone boundaries are accurate. Having 
said this tttcfagh, we have successfully used the aligner described in the diphone 
chapter abo^*to label general utterance where we knew which phone string we 
were looking using such an aligner may be a useful first pass, but the result 
should always oe checked by hand. 

It has been suggested that aligning techniques and unit selection training techniques 
can be used to jucrgg«the accuracy of the labels and basically exclude any segments 
that appear to fall ouiside the typical range for the segment type. Thus it, is believed 
that unit selection alg^lthms should be able to deal with a certain amount of noise in 
the labeling. This is the^erssire for researchers in the field, but we are some way from 
that and the easiest wayltfuaresent to improve the quality of unit selection algorithms 
at present is to ensure thfaVsegmental labeling is as accurate as possible. Once we 
have a better handle on sel^ion techniques themselves it will then be possible to 
start experimenting with noi$y^abeling. 

However it should be added thaf this unit selection technique (and many others) 
support what is termed "optimaL^oupling" [conkie96] where the acoustically most 
appropriate join point is found automatically at run time when two units are selected 
for concatenation. This technique Is5Viherently robust to at least a few tens of mil- 
lisecond boundary labeling errors. ^\ 

For the cluster method defined here it isJb>est to construct more than simply segments, 
durations and an FO target. A whole syllabi structure plus word boundaries, intona- 
tion events and phrasing allow a much richer set of features to be used for clusters. 
See the Section called Utterance building in ^h^pter 3 for a more general discussion of 
how to build utterance structures for a database. 



Making cepstrum parameter files *^ 

In order to cluster similar units in a database we buiM an acoustic representation 
of them. This is is also still a research issue but innbe example here we will use 
Mel cepstrum. Interestingly we do not generate these wfixed intervals, but at pitch 
marks. Thus have a parametric spectral representation oQkch pitch period. We have 
found this a better method, though it does require that pifchmarks are reasonably 
identified. ]a\ 

Here is an example script which will generate these parameters for a database, it is 
included in festvox/src/unitsel/make mcep. ^-^v 

for i in $* • 

do 

fname='basename $i .wav' 

echo $fname MCEP O 

$SIG2FV $SIG2FVPARAMS -otype est_binary $i -o mcep/$fname.mcep -pmvgjy $fname.pm- 

window_type hamming < 

done 



The above builds coefficients at fixed frames. We have also experimented with build- 
ing parameters pitch synchronously and have found a slight improvement in the 



225 



Chapter 12. Unit selection databases 



usefulness of the measure based on this. We do not pretend that this part is par- 
ticularly neat in the system but it does work. When pitch synchronous parameters 
are build the clunits module will automatically put the local FO value in coefficient 
0 at load time. This happens to be appropriate from LPC coefficients. The script in 

f estvox/src/general/make_lpc can be used to generate the parameters, assuming 
you have already generated pitch marks. 

Note the secondary advantage of using LPC coefficients is that they are required 
any way for LPC resynthesis thus this allows less information about the database 
to be required at run time. We have not yet tried pitch synchronous MEL frequency 
cepstrum ^gfficients but that should be tried. Also a more general duration/ number 
of pitch perils match algorithm is worth defining. 



ALSOJNCLUDE += clunft 



4* 

Building the*clusters 

Cluster building isTOi^tly automatic. Of course you need the clunits modules com- 
piled into your vers^en of Festival. Version 1.3.1 or later is required, the version of 
clunits in 1.3.0 is bu^jgy and incomplete and will not work. To compile in clunits, 
add 

to the end of your f estivai^onf ig/conf ig file, nad recompile. To check if an in- 
stallation already has support TraJcfLunits check the value of the variable * modules*. 

The file f estvox/src/unitsel/<kft^.ld_clunits . son contains the basic parameters 
to build a cluster model for a databases that has utterance structures and acoustic 
parameters. The function build_VLunits will build the distance tables, dump 
the features and build the cluster trdjes\ There are many parameters are set for the 
particular database (and instance of \cl&£ter building) through the Lisp variable 
clunits_params. An reasonable set of defeults is given in that file, and reasonable 
run-time parameters will be copied into f estvox/iNST_LANG_vox_clunits . scm 
when a new voice is setup. \> 

The function build_clunits runs through a steps but in order to better explain 
what is going on, we will go through each ste(^and at that time explain which pa- 
rameters affect the substep. 

The first stage is to load in all the utterances in uwcfatabase, sort them into segment 
type and name them with individual names (as tyfE_nkm. This first stage is required 
for all other stages so that if you are not running bYild_clunits you still need to 
run this stage first. This is done by the calls (3 

(format t "Loading utterances and sorting types\n") » f\ 

(set! utterances (acost:db_utts_load dt_params)) 
(set! unittypes (acost:find_same_types utterances)) 
(acost:name_units unittypes) 

Though the function build_clunits_init will do the same tbfing. 
This uses the following parameters * 

name STRING O 
A name for this database. *<^> 

db_dir FILENAME 

This pathname of the database, typically . as in the current directory. 

utts_dir FILENAME 

The directory contain the utterances. 



226 



Chapter 12. Unit selection databases 



Utts_ext FILENAME 

The file extention for the utterance files 

files 

The list of file ids in the database. 
For example for the KED example these parameters are 

(name.'ked_timit) 

(db_*fir^/ usr/awb/ data/timit/ked/") 
(utts_dj£" festival/ utts/") 
(utts_e5?^itt") 

(files ("kdgfiOl" "kdt_002" "kdt_003" ... )) 

In the examples below the list of fileids is extracted from the given prompt file at 
call time. \j ^ » 

The next stage is to ^erad the acoustic parameters and build the distance tables. The 
acoustic distance beb^kn each segment of the same type is calculated and saved in 
the distance table. Prec^mulating this saves a lot of time as the cluster will require 
this number many timesY^^ 

This is done by the f ollowirwp>iwo function calls 

(format t "Loading coefficients\n'^) 
(acost:utts_load_coeffs utteraf\^e*s) 
(format t "Building distance tab]^#^n") 
(acost:build_disttabs unittypes ckffrite_params) 

The following parameters influence^fte behaviour. 

coef f s_dir FILENAME 

The directory (from db_dir) that contains the acoustic coefficients as generated 
by the script make_mcep. 

coeffs_ext FILENAME ^ 
The file extention for the coefficient files • 

get_std_per_unit \ 

Takes the value t or nil. If t the parameters forThaiype of segment are normal- 
ized by finding the means and standard deviationsJor the class are used. Thus a 
mean mahalanobis euclidean distance is found bet^(^i units rather than simply 
a euclidean distance. The recommended value is t. 

ac_left_context FLOAT 

The amount of the previous unit to be included in the the^distance. 1.0 means 
all, 0.0 means none. This parameter may be used to make l^ie acoustic distance 
sensitive to the previous acoustic context. The recommended value is 0 . 8. 



dur_pen_weight FLOAT 

The penalty factor for duration mismatch between units. 

f 0_pen_weight FLOAT 

The penalty factor for F0 mismatch between units. 



% 



227 



Chapter 12. Unit selection databases 



ac_weights (FLOAT FLOAT ...) 

The weights for each parameter in the coefficeint files used while finding the 
acoustic distance between segments. There must be the same number of weights 
as there are parameters in the coefficient files. The first parameter is (in normal 
operations) FO. Its is common to give proportionally more weight to FO that to 
each individual other parameter. The remaining parameters are typically MFCCs 
(and possibly delta MFCCs). Finding the right parameters and weightings is one 
the key goals in unit selection synthesis so its not easy to give concrete recom- 
mendations. The following aren't bad, but there may be better ones too though 
we subject that real human listening tests are probably the best way to find bet- 
ter vali 



An example 

(coeffs_dir '*mq 
(coeffs_ext " 
(dur_pen_weig 
(get_stds_per_i 
(ac_left_context O^r > 
(ac_weights 
(0.5 0.5 0.5 0.5 0.5 ttS&5 0.5 0.5 0.5 0.5 0.5)) 

The next stage is to dump tne^features that will be used to index the clusters. Re- 
member the clusters are definedAvith respect to the acoustic distance between each 
unit in the cluster, but they are uidexed by these features. These features are those 
which will be available at text-to-speech time when no acoustic information is avail- 
able. Thus they include things like jSiSibnetic and prosodic context rather than spectral 
information. The name features may ^5jfid probably should) be over general allowing 
the decision tree building program wamjn to decide which of theses feature actual 
does have an acoustic distinction in theuinits. 

The function to dump the features is ^ > 

(format t "Dumping features for clustering\rK) 
(acost:dump_features unittypes utterances clui^^_params) 

The parameters which affect this function are • > 

fests_dir FILENAME \^ 

The directory when the features will be saved (byvaegment type). 

feats LIST ^^5^ 

The list of features to be dumped. These are standard featrval feature names with 
respect to the Segment relation. 

For our KED example these values are 



(feats_dir "festival/feats/") 
(feats (\ 
(occurid 

p. name p.ph_vc p.ph_ctype 



o. 



iidiiie p.pii_vc p.pn_i„Ly pe ' i 

p.ph_vheight p.ph_vlng 
p.ph_vfront p.ph_vrnd 



p.ph_cplace p.ph_cvox 
n.name n.ph_vc n.ph_ctype 
n.ph_vheight n.ph_vlng 
n.ph_vfront n.ph_vrnd 
n.ph_cplace n.ph_cvox 
segment_duration 



228 



Chapter 12. Unit selection databases 



seg_pitch p.seg_pitch n.seg_pitch 

R:SylStmcture.parent.stress 

seg_onsetcoda n.seg_onsetcoda p.seg_onsetcoda 

R:SylStructure.parent.accented 

pos_in_syl 

syl_initial 

syl_final 

R:SylStructure.parent.syl_break 
R:SylStructure.parent.R:Syllable.p.syl_break 
pp.name pp.ph_vc pp.ph_ctype 
>.ph_vheight pp.ph_vlng 
i<ph_vfront pp.ph_vrnd 
Sh_cplace pp.ph_cvox)) 




Now that we have the acoustic distances and the feature descriptions of each unit the 
next stage is to firodJaijelationship between those features and the acoustic distances. 
This we do using thffpART tree builder wagon. It will find out questions about which 
features best minimizesffae acoustic distance between the units in that class, wagon has 
many options many orwWch are apposite to this task though it is interesting that this 
learning task is interestingly closed. That is we are trying to classify all the units in 
the database, there is no fij set as such. However in synthesis there will be desired 
units whose feature vector d^n't exist in the training set. 



The clusters are built by the ftmpwing function 
(format t "Building cluster trees\t*!0 

(acost:find_clusters (mapcar car i^rfittypes) clunits_params) 
The parameters that affect the tree buildjh% process are 

tree_dir FILENAME > 

the directory where the decision tree f dr^(ch segment type will be saved 

wagon_f ield_desc LIST ^0 

A filename of a wagon field descriptor file.*This. is a standard field description 
(field name plus field type) that is require fo^wagon. An example is given in 
f estival/clunits/all . desc which should be^ufficient for the default feature 
list, though if you change the feature list (or the values those features can take 
you may need to change this file. \J 

wagon_progname FILENAME \Q 

The pathname for the wagon CART building prograrru^Frys is a string and may 
also include any extra parameters you wish to give to waflU. 

wagon_cluster_size INT 

The minimum cluster size (the wagon -stop value). • 

prune_reduce INT 

This number of elements in each cluster to remove in pruning/TOjs removes 
the units in the cluster that are furthest from the center. This is dow^\vithin the 
wagon training. 

cluster_prune_limit INT 

This is a post wagon build operation on the generated trees (and perhaps a more 
reliably method of pruning). This defines the maximum number of units that 
will be in a cluster at a tree leaf. The wagon cluster size the minimum size. This is 

229 



Chapter 12. Unit selection databases 



usefully when there are some large numbers of some particular unit type which 
cannot be differentiated. Format example silence segments without context of 
nothing other silence. Another usage of this is to cause only the center example 
units to be used. We have used this in building diphones databases from general 
databases but making the selection features only include phonetic context fea- 
tures and then restrict the number of diphones we take by making this number 
5 or so. 

unittype_prune_threshold INT 

When^jaking complex unit types this defines the minimal number of units of 
that tyg*£ required before building a tree. When doing cascaded unit selection 
synthesizes its often not worth excluding large stages if there is say only one 
example qi&a particular demi-sy liable. 



Note that as the distapce tables can be large there is an alternative function that does 
both the distance tarjp and clustering in one, deleting the distance table immediately 
after use, thus you on^need enough disk space for the largest number of phones in 
any type. To do this 



(acost:disttabs_and_clus*e^urtittypes clunits_params) 

Removing the calls to acostv^uild_disttabs and acost : f ind_clusters. 
In our KED example these havej^ie values 

(trees_dir "festival /trees/") p. 
(wagon_field_desc "festival/durHls/all.desc") 

(wagon_progname "/ usr/awb/ prefects/ speech_tools /bin/ wagon") 
(wagon_cluster_size 10) \ * 

(prune_reduce 0) \ ^ 

\ 

The final stage in building a cluster model lsx^llect the generated trees into a single 
file and dumping the unit catalogue, i.e. the Usi of unit names and their files and 
position in them. This is done by the lisp function 

t 

(acost:collect_trees (mapcar car unittypes) clunits, 
(format t "Saving unit catalogue\n") 
(acost:save_catalogue utterances clunits_params) 



O 

The only parameter that affect this is 



catalogue_dir FILENAME 



the directory where the catalogue will be save (the nam^-8arameter is used to 
name the file). ^^J< 

Be default this is ^ 
(catalogue_dir "festival/clunits/") 

% 

There are a number of parameters that are specified with a cluster voice. These are 
related to the run time aspects of the cluster model. These are 



220 



Chapter 12. Unit selection databases 



join_weights FLOATLI ST 

This are a set of weights, in the same format as ac_weights that are used in 
optimal coupling to find the best join point between two candidate units. This is 
different from ac_weight s as it is likely different values are desired, particularly 
increasing the FO value (column 0). 

continuity_weight FLOAT 

The factor to multiply the join cost over the target cost. This is probably not very 
relevapt given the the target cost is merely the position from the cluster center. 

log_scores^ 

If specifi^jthe joins scores are converted to logs. For databases that have a 
tendency to contain non-optimal joins (probably any non-limited domain 
databases), thi&may be useful to stop failed synthesis of longer sentences. The 
problem is trt&tThe sum of very large number can lead to overflow. This helps 
reduce this. YotOould alternatively change the continuity_weight to a number 
less that 1 whichrtsftpuld also partially help. However such overflows are often a 
pointer to some ofhpa^problem (poor distribution of phones in the db), so this is 
probably just a hacl*< 

optimal_coupling INT 

If l this uses optimal co#rf)ling and searches the cepstrum vectors at each join 
point to find the best poslsible>join point. This is computationally expensive (as 
well as having to load in lstspf cepstrum files), but does give better results. If 
the value is 2 this only checKOhe coupling distance at the given boundary (and 
doesn't move it), this is often ^jdj^quate in good databases (e.g. limited domain), 
and is certainly faster. 

extend_selections INT \^ 

If l then the selected cluster will be extended to include any unit from the cluster 
of the previous segments candidate uflfte that has correct phone type (and isn't 
already included in the current clusterTjJhis is experimental but has shown its 
worth and hence is recommended. This means that instead of selecting just units 
selection is effectively selecting the beginsywgs of multiple segment units. This 
option encourages far longer units. # 

pm_coef f s_dir FILENAME » 

The directory (from db_dir where the pitchmarks^a^e 

pm_coef f s_ext FILENAME 

The file extension for the pitchmark files. 

sig_dir FILENAME 

Directory containing waveforms of the units (or residualsTfJjtesidual LPC is be- 
ing used, PCM waveforms is PSOLA is being used) \ 

sig_ext FILENAME 

File extension for waveforms / residuals 

join_method METHOD *<^> 

Specify the method used for joining the selected units. Currently it supports 
simple, a very naive joining mechanism, and windowed, where the ends of the 
units are windowed using a hamming window then overlapped (no prosodic 
modification takes place though). The other two possible values for this fea- 
ture are none which does nothing, and modif ied_lpc which uses the standard 
UniSyn module to modify the selected units to match the targets. 

222 



Chapter 12. Unit selection databases 



clunits_debug 1/2 

With a value of 1 some debugging information is printed during synthesis, par- 
ticularly how many candidate phones are available at each stage (and any ex- 
tended ones). Also where each phone is coming from is printed. 

With a value of 2 more debugging information is given include the above plus 
joining costs (which are very readable by humans). 



% 

Building a Unit Rejection Cluster Voice 

The previous section gives the low level details ofin the building of a cluster unit 
selection voice. Tbifi^ection gives a higher level view with explict command that you 
should run. The step^involved in building a unit selection voices are basically the 
same as that for building a limited domain voice (Chapter 5). Though in for general 
voices, in constrast tcCh^om voice, it is much more important to get all parts correct, 
from pitchmarks to lab^lfijg. 

The following tasks are rq^ired: 




• Read and understand all fhja^sues regarding the following steps 

• Design the prompts ^> * 

• Record the prompts v^) 

• Autolabel the prompts \§\ 

• Build utterance structures for reco K^d utterances 

• Extract pitchmark and build LPC confidents 

• Building a clunit based synthesizer fro^i^he utterances 

• Testing and tuning 

The following are the commands that you^rnust type (assuming all the other 
hardwork has been done beforehand. It is assutneihat the environment variables 
festvoxdir and estdir have been set to point tQheir respective directories. For 
example as ^> 

export FESTVOXDIR=/home / awb/ projects/ festvox ^"^-^ 
export ESTDIR= /home / awb /projects / 1 .4.3/ speech_tools 

Next you must select a name for the voice, by conventionale use three part names 
consisting of a institution name, a language, and a speaker Make a directory of that 
name and change directory into it 

mkdir cmu_us_awb 
cd cmu_us_awb 

There is a basic set up script that will construct the directory structufe)and copy in 
the template files for voice building. If a fourth argument is given, it carves name one 
of the standard prompts list. 

For example the simplest is uniphone. This contains three sentences which contain 
each of the US English phonemes once (if spoken appropriately). This prompt set 
is hopelessly minimal for any high quality synthesis but allows us to illustrate the 
process and allow you to build a voice quickly. 



222 



Chapter 12. Unit selection databases 



$FESTVOXDIR/ src/ unitsel/ setup_clunits emu us awb uniphone 

Alternatively you can copy in a prompt list into the etc directory. The format of these 
should be in the standard "data" format as in 

( uniph_0001 "a whole joy was reaping." ) 

( uniph_0002 "but they've gone south." ) 

( uniph_0003 "you should fetch azure mike." ) 

Note the spaces after the initial left parenthesis are significant, and double quotes 
and backslashes within the quote part must be escaped (with backslash) as is com- 
mon in PerHj^Festival itself. 

The next stage*2^to generate waveforms to act as prompts, or timing cues even if the 
prompts are not^ctually played. The files are also used in aligning the spoken data. 




festival -b festvox/^p^d_clunits.scm '(build_prompts "etc/uniphone.data")' 

Use whatever promp^ile you are intending to use. Note that you may want to add 
lexical entries to f estv^s?vwHATEVER_lexicon . scm and other text analysis things as 
desired. The purpose is ma4ihe prompt files match the phonemes that the voice talent 
will actually say. \V 

You may now record, assurrSa you have prepared the recording studio, gotten writ- 
ten permission to record you^>*peaker (and explained to them what the resulting 
voice might be used for), checkea* recording levels and sound levels and shield the 
electrical equipment as much as jp^sible. 



. /bin/ prompt_them etc / uniphone. iMa^ 

After recording the recorded files shou]ir"be in wav/. It is wise to check that the are 
actually there and sound like you expecteek Getting the recording quality as high as 
possible is fundamental to the success of ibuijdmg a voice. 



Now we must label the spoken prompts. Wedo this my matching the synthesized 
prompts with the spoken ones. As we knowCjvhere the phonemes begin and end in 
the synthesized prompts we can then map th^)onto the spoken ones and find the 
phoneme segments. This technique works fairly well, but it is far from perfect and it 
is worthwhile to at least check the result, and mds^r^obably fix the result by hand. 

. /bin/ make_labs prompt-wav/*.wav 




Especially in the case of the uniphone synthesizer, vyhere there is one and only 
one occurrence of each phone they all must be correct \s_b-rts important to check the 
labels by hand. Note for large collections you may find the^fji&l Sphinx based labeling 
technique better the Section called Labeling with Full Acoust^er&jodels in Chapter 14). 

After labeling we can build the utterance structure using the p^oijnpt list and the now 
labeled phones and durations. 

festival -b festvox/build_clunits.scm '(build_utts "etc/uniphone.data")* 

The next stages are concerned with signal analysis, specifically pitch 
marking and cepstral parameter extraction. There are a number of 
methods for pitch mark extraction and a number of parameters within 
these files that may need tuning. Good pitch periods are important. See 
the Section called Extracting pitchmarks from waveforms in Chapter 4 . In its simplest 
case the follow may work 



223 



Chapter 12. Unit selection databases 



./bin/make_pm_wave wav/*.wav 



The next stage it find the Mel Frequency Cepstral Coefficents. This is done pitch syn- 
chronously and hence depends on the pitch periods extracted above. These are used 
for clustering and for join measurements. 

./bin/make_mcep wav/*.wav 



Now we carfjl® the main part of the build, building the cluster unit selection synthe- 
sizer. This corv^sf s of a number os stages all based on the controlling Festival script. 
The parameters of which are described above. 



festival -b festvoxyoujid_clurdts.scm '(build_clunits "etc/uniphone.data")' 




For large databases tH& can take some time to run as there is a squared aspect to this 
based on the number oy^tances of each unit type. 

% 

Diphones from general databases 

As touched on above the choice of .an inventory of units can be viewed as a line from 
a small inventory phones, to diphones, triphones to arbitrary units. Though the direc- 
tion you come from influences thi^election of the units from the database. CHATR 
[campbell96] lies firmly at the "arrSltrary units" end of the spectrum. Although it can 
exclude bad units from its inventory rtjis very much "everything minus some" view of 
the world. Microsoft's Whistler [hua^g97] on the other hand, starts off with a gen- 
eral database base but selects typical unjft from it. Thus its inventory is substantially 
smaller than the full general database tha^nits are extracted from. At the other end 
of the spectrum we have the fixed pre-specjfied inventory like diphone synthesis as 
has bee described in the previous chapter. \-> 

In this section we'll give some examples of moving along the line from the fixed pre- 
specified inventory to the words the more ger(^al inventories but these techniques 
still have a strong component of prespecification^ 

Firstly lets us assume you have a general databa^eSthat is labeled with utterances 
as described above. We can extract a standard diphopfe database from this general 
database, however unless the database was specificaNyiJesigned, a general database 
is unlikely to have diphone coverage. Even when phonetically rich databases are 
used such as Timit there is likely to be very few vowel-@wel diphones as they are 
comparatively rare. But as these diphone are rare we maybff^able to do with out them 
and hence it is at least an interesting exercise to extract anjJg complete as possible 
diphone index from a general database. y ^\J 

The simplest method is to linearly search for all phone-pho^^ pairs in the phone 
set through all utterances simply taking the first example. Som^same code is given 
in src/diphone/make_diphs_index . scm. This basic idea is to lo^d in all the utter- 
ances in a database, and index each segment by is phone name and succeeding phone 
name. Then various selection techniques can be use to select from memultiple can- 
didates of each diphone (or you can split the indexing further). Aftei^s^lection a di- 
phone index file can be saved. 



The utterances to load are identified by a list of fileids. For example if the list of fileids 
(without parenthesis) is in the file etc/fileids, the following will builds a diphone 
index. 



festival .../make_diphs_utts.scm 



224 



Chapter 12. Unit selection databases 



festival> (set! fileids (load "etc/fileids" t)) 

festival> (make_diphone_index fileids "dic/f2bdiph.est") 



Note that as this diphone index will contain a number of holes you will need to 
either augment it with "similar" diphones or process your diphone selections through 
UniSyn_module_hooks as described in the previous chapter. 

As you complicate the selection, and the number of diphones you used from the 
database will need to complicate the names used to identify the diphones them- 
selves. The e&rvyention of using underscores for syllable internal consonant clusters 
and dollars fogjsyllable initial consonants can be followed, but you will need to go 
further if you wiSh to start introducing new feature such as phrase finality and stress. 
Eventually going to a generalized naming scheme (type and number) as used by 
the cluster selecti^^ technique described above, will prove worth while. Also us- 
ing CART trees, thraa&h hand written and fully deterministic (one candidate at the 
leafs), will be a reas®n^ble algorithm to select between hand stipulated alternatives 
with reasonable back&JPstrategies. 

Another potential direction is to use the acoustic costs used in the clustering methods 
described in the previous ^Si^tion. These can be used to identify what the most typical 
unit in a cluster are (the rnemi distances from all other units are given in the leafs). 
Pruning these trees until tneduster only contain a single example should help to 
improve synthesis, in that varj^ion in the feature in the "diphone" index will then be 
determined by the features spev^Aed in the cluster train algorithm. Of course though 
as you limit the number of distujeKunits types the more prosodic modification will 
be required by your signal processing algorithm, which requires that you have good 
pitch marks. ^ >• 

If you already have an existing database put don't wish to go to full unit selection, 
such techniques are probably quite fea^ole and worth further investigation. 





o 





225 



Chapter 12. Unit selection databases 




126 



Chapter 13. Statistical Parametric Synthesis 



Building a CLUSTERGEN Statistical Parametric Synthesizer 

This method, inspired the work of Keiichi Tokuda and NITECH's HMM Speech 
Synthesis Toolkit, is a method for building statistical parametric synthesizers from 
databases of natural speech. Although the result is still not as crisp as a well done 
unit selection voice, this method is much easier to get a nice clear synthetic voice that 
models the^>riginal speaker well. 

Although th^jnethod is partially "tagged on to" the clunits method, it is actually 
quite independent. The tasks are as follows. 

Read and understand all the issues regarding the following steps 
Set up the direcrraj^structure 
Record or import mesp rom pt s and prompt list 
Label the data with t^^iMM-state sized segments 
Build utterance structur^ for recorded utterances 
Extract FO, voicing and rn^p coefficients. 
Build a CLUSTERGEN voi¥0 
Build an HMM-state duratioi^hodel 
Testing v ^ 



We assume you have read the rest^f this chapter (though, in reality, we know you 
probably haven't), thus the descriptic^ns here are quite minimal. 

First make an empty directory and in ib(fun the setup_cg setup command. 

mkdir cmu_us_awb_arctic 

cd cmu_us_awb_arctic sK 

$FESTVOXDIR/ src/ clustergen/ setup_cg cimj us awb_arctic 

In you already have an existing voice running setup_cg will only copy in the neces- 
sary files for clustergen, however I recommend starting from scratch as I don't know 
when you created your previous voice and I'm not^sjXre of its exact state. 

Now you need to get your waveform files and prompt file. Put your waveform files 
in the wav/ and your prompt file in etc/txt . done . da@. Note you should probably 
use bin/get_wavs to copy the wavefiles so that they gef^bpwer normalized and get 
changed to a reasonable format (16KHz, 16bit, RIFF formStk. 

In you are going to record them in your current directory, v^ti^hould call 

./bin/ do_build build_prompts 
first to generate example waveforms, then use 

./bin/prompt_them etc/txt.done.data 1 

To prompt you and record the prompts. You must check that the recof^Ag actually 
works. It should generate recordings in the wav/. You can use $ESTDiR/bin/na_play 
to play the waveform files. prompt_them can be stopped with ctrl-c and restarted at 
the line number given as the second argument. 

The next stage is to label the data. If you aren't very knowledgeable about labeling 
in clustergen, you should use the EHMM labeler. EHMM constructs the labels in the 
right format for segments and HMM states, and matches them properly with what 



227 



Chapter 13. Statistical Parametric Synthesis 



the synthesizer generates for the prompts. Using other labels is likely to cause more 
problems. Even if you already have other labels use EHMM first. 

./bin/ do_build build_prompts 
./bin/ do_build label 
./bin/ do_build build_utts 

The EHMM labeler has been shown to be very reliable, and can nicely deal with 
silence insertion. It isn't very fast though and will take several hours. You can check 
the file ehngft/mod/loglOO . txt to see the Baum- Welch iterations, there will probably 
be 20-30. ^ARCTIC a-set takes about 3-4 hours to label. 



Parametric symhesis require a reversible parameterization, this set up here uses a 
form of mel cep^trum, the same version that is used by NITECH's basic HTS build. 
Parameter build»is in two parts building the FO and building the mceps themselves. 
Then these are combined into a single parameter file for each utterance in the 
database. ^ > 

V 

./bin/ do_clustergef^W) » 

./bin/ do_clustergen'rffa;p 

. /bin/ do_clustergen Vmeirig 

./bin/ do_clustergen coioyine_coeffs_v 

The mcep part takes the longest. Note that the FO part now tries to estimate the range 
of the FO on the speaker and rrvoaifjes parameters for the FO extraction program. (The 
FO params are saved in etc/f o\(params.) 

If you want to have a test set of utt^ances, you can separate out some of your prompt 
list. The test set should be put in th^tle etc/txt . done . data . test The follow com- 
mands will make a training and test seK(every 10th prompt in the test set, the other 9 
in the training set) . ^ » 

./bin/ traintest etc/ txt.done.data , 
cat etc/txt.done.data.train >etc/txt.done.d« 



The next stage is to generate is to build the parametric model. There parts are re- 
quired for this. This first is very quick and simply puts the state (and phone) names 
into their respective files. It assumes a file etc/JOtenames which is generate by 
EHMM. The second stage build the parametric mocfel^tself . The last builds a dura- 
tion model for the state names ^ 

./bin/ do_clustergen generate_statenames \^^* 
./bin/ do_clustergen cluster y 0^ 

. /bin/ do_clustergen dur 

The resulting voice should now work 

festival festvox / cmu_us_awb_arctic_cg.scm 

festival> (voice_cmu_us_awb_arctic_cg) (3 
festival> (Say Text "This is a little example.") 



The voice can be packaged for distribution by the command 



128 



./bin/ do_clustergen festvox_dist 



Chapter 13. Statistical Parametric Synthesis 



This will generation f estvox_cmu_us_awb_arctic_cg . tar . gz which will be quite 
small compared to a clunit voice made with the same databases. Because only the pa- 
rameters are kept (in fact only means and standard deviations of clusters of of param- 
eters) which do not include residual or excitation information the result is something 
orders of magnitude smaller that a full unit selection voices. 

There two other options in the clustergen voice build. These involve modeling tra- 
jectories rather than individual vectors. They give objectively better results (though 
marginal subjectively better results for the voices we have tested). Instead of the line 

. /bin/ ctS^clustergen cluster 

Y 

You can run 

./bin/ do_clus*tergen trajectory 
or the slightly bett^fS 

. /bin/ do_clustergen?n^ij£ctory_ola 

These two options may after the simple version of the voice. 

You can test your voice wi(§j)held out data, if you did this in the above step that 
created etc/txt . done . datay^sst You can run 

v- 

$FESTVOXDIR/ src/clustergen>A^test resynth cgp 

NOTE: This no longer works automatically, as you need static mceps and ccoefs for 
this to work. This will create paramet^ files (and waveform files) in test/cgp. The 
output of the cg_test is also four measures the mean difference for all features in the 
parameter vector, for FO alone, for all byf FO, and MCD (mel ceprstral distortion). 

% 



'•6 



o 



% 



229 



Chapter 13. Statistical Parametric Synthesis 




130 



Chapter 14. Labeling Speech 



In the early days of concatenative speech synthesis every recorded prompt had to be 
hand labeled. Although a significant task, very skilled and mind bogglingly tedious 
it was a feasible task to attempt when databases were relatively and the time to build 
a voice was measure in years. With the increase in size of database and the demand 
for much faster turnaround we have moved away from hand labeling to automatic 
labeling. 

In this section we will only touch the the aspects of what we need labeled in recorded 
data but d1g£aiss what techniques are available for how to label it. As discussed before 
phonemes ai^g useful but incomplete inventory of units that should be identified but 
other aspects oMexical stress, prosody, allophonic variations etc are certainly worthy 
of consideration^ 

In labeling recorded prompts for synthesis we rely heavily on the work that has been 
done in the speechjrecpgnition community For synthesis we do, however, have dif- 
ferent goals. In ASKj^utomatic speech recognition) we are trying to find the most 
likely set of phones that are in a given acoustic observation. In synthesis labeling, 
however we know the^Wuence of phones spoken, assuming the voice talent spoke 
the prompt properly, an^Pwish to find out where those phones are in the signal. We 
care, very deeply, about t]S)boundaries of segments, while ASR can be achieve ade- 
quately performance by onljrsoncerning itself with the centers, and hence has rightly 
been optimized for that. J^K. 

AWB: that point deserves more discussion, thougffnmybe not here 

There are other distinctions from SR task, in synthesis labeled we are concerned 
with a singled speaker, that is, if th&iynthesizer is going to work well, very carefully 
performed and consistently recorded^TTris does make things easier for the labeling 
task. However in synthesize labeling w v ejire also concerned about prosody, and spec- 
tral variation, much more than in ASRX v- 

We discuss two specific techniques for labehug record prompts here, which each have 
their advantages and limitations. ProcedurQg^unning these are discussed at the end 
of each section. 

The first technique uses dynamic time warping a@iment techniques to find the phone 
boundaries in a recorded prompt by align it against a synthesized utterance where 
the phone boundary are know. This is computationally easier than second technique 
and works well for small databases which do not rWe full phonetic coverage. 

The second technique uses Baum-Welch training to buMd-complete ASR acoustic mod- 
els from the the database. This takes sometime, but iLme database is phonetically 
balanced, as should be the case in databases designed (ok speech synthesis voices, 
can work well. Also this technique can work well on databases in languages that do 
not yet have a synthesizer, hence making the dynamic timewarping technique hard 
without cross-language phone mapping techniques. 

Labeling with Dynamic Time Warping 

DTW (dynamic time warping) is a technique for aligning some ne\((?^cording with 
some known one. This technique was used in early speech recognition systems which 
had limit vocabularies as it requires a acoustic signal for each worcVphrase to be 
recognized. This technique is sometime still used in matching two aiJdp* signal in 
command and control situations, for example in some cell-phone for voice dialing. 

What is important in DTW alignment is that it can deal with signals that have vary- 
ing durations. The idea has been around for many years, though its application to 
labeling in synthesis is relative new. The work here is based on the detail published 
in [malfrere]. 



131 



Chapter 14. Labeling Speech 



Comparing raw acoustic score is unlikely to given god results so comparisons are 
done in then spectral domain. Following ASR techniques we will use Mel Frequency 
Cepstral Coefficients to represent the signal, and also following ASR we will in- 
clude delta MFCCs (the different between the current MFCC vector and the previous 
MFCC vector). However for the DTW algorithm the content of the vectors is some- 
what irrelevant, and are merely treated as vectors. 

The next stage is define a distance function between two vectors, conventionally we 
use Euclidean Distance defined as 



root (suofetf(i-n) (vOi - vli) A 2 

Weights coufjljbe considered too. 

The search itself is best picture as a large matrix. The algorithm then searches for 
the best path through the matrix. At each node it finds the distance between the two 
current vectors arkj Sujns it with the smallest of three potential previous states. That 
is one of i-l,j, i,j-l, Ofn)-l,j-l. If two signals were identical the best path would be the 
diagonal through th'&-a\atrix, if one part of the signal is shorter or longer than the 
corresponding one hoitf^ehtal or vertical parts will have less cost. 

matrix diagram (more th^i)one) 
AWB: describe the makejabs stuff and cross-langmge phone mapping 

Labeling with Full Acoustic Models 

A second method for labeling is alsC^available. Here we train full acoustic HMM 
models on the recorded data. We buil^a database specific speech recognition en- 
gine and use that engine to label the data^As this method can work from recorded 
prompts plus orthography (and a methodHoproduce phone strings from that orthog- 
raphy), this works well when you have ncVjsynthesizer to bootstrap from. However 
such training requires that the database hag(1r suitable number of examples of tri- 
phones in it. Here we have an advantage. As thr^requirements for a speech synthesis 
data, that it is has a good distribution of phonemes, is the same as that require for 
acoustic modeling, a good speech synthesis dataBases should produce a good acous- 
tic model for labeling. Although there is no neatly^^fined definition of what "good" 
is, we can say that you probably need at least 400 utterances, and at least 15,000 seg- 
ments. 400 sentences all starting with "The time is no\^-^." probably wont do. 

Other large database synthesis techniques use the same/basic techniques to not just 
label the database but define the units to be selected. [CBnqvan95] and others label 
there data with an acoustic model build (with Baum-Welc^utVaining) and use the de- 
fined HMM states (typically 3-5 per phoneme) as the units--f(OT) selecting. [Tokuda9?] 
actually use the state models themselves to generate the unitsTptvt again use the same 
basic techniques for labeling. ^*>C 

For training we use Carnegie Mellon University's SphinxTrairrand Sphinx speech 
recognition system. There are other accessible training systems o\it there, HTK be- 
ing the most famous, but SphinxTrain is the one we are most famil^ai^with, and we 
have some control over its updates so can better ensure it remains appropriate for 
our synthesis labeling task. As voice building is complex, acoustic moqpWpuilding is 
similarly so. SphinxTrain has been reliably used to labeling hundreds t^Pdatabases 
in many different languages but making it utterly robust against unseen data is very 
hard so although we have tried to minimize the chance of things going wrong (in 
non-obvious ways), we will not be surprised that when you try this processing on 
some new database there may be some problems. 

SphinxTrain (and sphinx) have a number of restrictions which we need to keep in 
mind when labeling a set of prompts. These a re code limitations, and may be fixed 

232 



Chapter 14. Labeling Speech 



in future versions of SphinxTrain/Sphinx. For the most part the are not actually seri- 
ous restrictions, just minor prompts that the setup scripts need to work around. The 
scripts cater for these limitations, and mostly will all go unseen by the user, unless of 
course something goes wrong. 

Specifically, sphinx folds case on all phoneme names, so the scripts ensure that phone 
names are distinct irrespective of upper and lower case. This is done by prepending 
"CAP" in front of upper case phone names. Secondly there can only be up to 255 
phones. This is likely only to be problem when SphinxTrain phones are made more 
elaborate than simple phones, so mostly wont be a problem. The third noted problem 
is limitatio^sn the length and complexity of utterances. The transcript files has a line 
length limit j& does the lexicon. For "nice" utterances this is never a problem but for 
some of our drajabases especially those with paragraph length utterances, the training 
and/or the laSSing itself fails. 

Sphinx2 is a re'al-time speech recognition system made available under a free 
software license. available from http:/ /cmusphinx.org. The source is available 
from http:/ /sourcefo^ge.net/projects/cmusphinx/. For these tests we used version 
sphinx2-0 . 4 . tar . <^zl >SphinxTrain is a set of programs and scripts that allow 
the building of acouatK>?models for Sphinx2 (and Sphinx3). You can download 
SphinxTrain from http:)V^ourcef orge.net/projects/cmusphinx/ 3 . Note that Sphinx2 
must be compiled and iKMalled while SphinxTrain can run in place. On many 
systems steps like these snca*Ld give you working versions. 

*> . 

tar zxvf sphinx2-0.4.tar.gz \^ 
mkdir sphinx2-runtime ^sK 
export SPHINX2DIR='pwd7sphifrx2-runtime 
cd sphinx2 ./configure --prefix=$SPjilNX2DIR 
make ^\ 
make install v 
cd.. y 
tar zxvf SphinxTrain-0.9.1-beta.tar.gz /\ 
cd SphinxTrain \ 
./configure \> 
make s\ 
export SPHINXTRAINDIR='pwd7SphinxTraM 

Now that we have sphinx2 and SphinxTrain insfi^lf^d we can prepare our FestVox 
voice for training. Before starting the training proces^sybu must create utterance files 
for each of the prompts. This can be done with the conventional festival script. 

festival -b festvox/build_clunits.scm '(build_prompts "et<s^^done.data")' 

This generates label files in prompt-lab/ and waveforrtikfiles in prompt-wav/ 
which technically are not needed for this labeling processrAltterances are saved 
in prompt-utt/. At first it was thought that the prompt fil4_?tc/txt . done . data 
would be sufficient but the synthesis process is likely to resolv-c^ronunciations hi 
context, though post-lexical rules etc, that would make naive conversion of the 
words in the prompt list to phone lists wrong in general so th? transcription for 
SphinxTrain is generated from the utterances themselves which ensures that they 
resulting labels can be trivially mapped back after labeling. Thus th^vord names 
generate by in this process are somewhat arbitrary though often hunjrfjn readable. 
The word names are the word themselves plus a number (to ensure>«niqueness 
in pronunciations). Only "nice" words are printed as is, i.e. those containing only 
alphabetic characters, others are mapped to the word "w" with an appropriate 
number following. Thus hyphenated, quoted, etc words will not cause a problem for 
the SphinxTrain code. 

After the prompt utterances are generated we can setup the SphinxTrain directory 
st/. All processing and output files are done within that directory until the file con- 

233 



Chapter 14. Labeling Speech 



version of labels back into the voice's own phone set and put in lab/. Note this 
process takes a long time, at least several hours and possible several days if you 
have a particularly slow machine or particularly large database. Also this may re- 
quire around a half a gigabyte of space. 

The script . /bin/sphinxtrain does the work of converting the FestVox database 
into a form suitable for SphinxTrain. In all there are 6 steps: setup, building files, con- 
verting waveforms, the training itself, alignment and conversion of label files back 
into FestVox format. The training stage itself consist of 11 parts and by far takes the 
most time. 



This script<¥e)^uires the environment variables sphinxtraindir and sphinx2dir to 
be set point ^compiled versions of SphinxTrain and Sphinx2 respectively, as shown 
above. <^ 

The first step is ^) set up the sub-directory st/ where the training will take place. 

./bin/ sphinxtrair^se^up 

The training databa^Xame will be taken from your etc/voice . def s, if you don't 
have one of those use y'} 

$FESTVOXDIR/ src/ gene^i/guess_voice_defs 

<> . 

The next stage is to convert theNd^tabase prompt list into a transcription file suitable 
for SphinxTrain,; construct a lexi^dk, and phone file etc. All of the generate files will 
be put in st/etc/. Note because St various limitations in Sphinx2 and SphinxTrain, 
the lexicon ( . die), and transcription^ .transcription) will not have what you might 
thing are sensible values. The word ^ames are take from the utts if they consist of 
only upper and lower case characterst^A* number is added to make them unique. 
Also if another work exists with the sarn^mronunciation but different word it may 
be assigned a differ name from what youWtject. The word names in the SphinxTrain 
files are only there to help debugging andTafre really referring to specific instances 
of words in the utterance (to ensure the prorranciations are preserved with respect 
to homograph disambiguation and post lexicaL#ules. If people complain about these 
being confusing I will make all words simple "W*' followed by a number. 

. /bin / sphinxtrain files 

The next stage is to generate the mfec's for SphinxTrairtpifortunately these must be 
in a different format from the mfec's used in FestVox, alsb^bhinxTrain only supports 
raw headered files, and NIST header files, so we copy the \^Veform files in wav/ into 
the st / wav/ directory converting them to NIST headers 



./bin/ sphinxtrain feats 



Now we can start the training itself. This consists of eleven stages ea^rV which will be 
run automatically. 

• Module 0 checks the basic files for training. There should be no errors^ this stage 

• Module 1 builds the vector quantization parameters. 

• Module 2 builds context-independent phone models. This runs Baum- Welch over 
the data building context-independent HMM phone models. This runs for several 
passes until convergences (somewhere between 4 and 15 passes). There may be 
some errors on some files (especially long, or badly transcribed ones), but a small 
number of errors here (with the identified file being "ignored" should be ok. 

234 



Chapter 14. Labeling Speech 



• Module 3 makes the untied model definition. 

• Module 4 builds context dependent models. 

• Module 5a builds trees for asking questions for tied-states. 

• Module 5b builds trees. One for each state in each HMM. This part takes the longest 
time. 

• Module 6 prunes trees. 

• Module 7 retrain context dependent models with tied states. 

• Module 8 deleted interpolation 

• Module 9 convert the generated models to Sphinx2 format 

All of the abgjse stages should be run together with the command as 
./bin/ sphinxtrain train 

Once trained we ca^r use these models to align the labels against the recorded 
prompts. (^) 

V 

./bin/ sphinxtrain align 

Some utterances may fait^ be labeled at this point, either because they are too 
long, or their orthography ck5e£ not match the acoustics. There is not simple solu- 
tion for this at present. For some>you wimple not get a label file, and you can either 
label the utterance by hand, extlijde if from the data, or split it into a smaller file. 
Other times Sphinx2 will crash apd you'll need to remove the utterances from the 
st /et c/ * . align and st/etc/* . ctf^fcnd run the script . /bin/sphinx_lab by hand. 

The final stage is to take the output ^om the alignment and convert the labels back 
into their FestVox format. If everythingAforked to this stage, this final stage should 
be uneventful. * ^ 

. /bin/ sphinxtrain labs 

(* 

There should be a set of reasonable phone labels in prompt-lab/. These can the be 
merged into the original utterances with the comr 



'etc /^td one. data") 

o 



Prosodic Labeling 

FO, Accents, Phrases etc. 



O 



Notes 



o 

1. http://cmusphinx.org Q 

2. http://sourceforge.net/projects/cmusphinx/ 

3. http:/ / www.speech.es. cmu.edu/SphinxTrain/ 



235 



Chapter 14. Labeling Speech 




136 



Chapter 15. Evaluation and Improvements 



This chapter discusses evaluation of speech synthesis voices and provides a detailed 
procedure to allow diagnostic testing of new voices. 



Evaluation 



Now that you have built your voice, how can you tell if it works, and how can you 
find out wb&t you need to make it better. This chapter deals with some issues of eval- 
uating a vofc&in Festival. Some of the points here also apply to testing and improving 
existing voicgfcxtoo. 

The evaluatio^W speech synthesis is notoriously hard. Evaluation in speech recogni- 
tion was the majpr factor in making general speech recognition work. Rigourous tests 
on well defined da*a made the evaluation of different techniques possible. Though 
in spite of its successJfhe strict evaluation criteria as used in speech recognition can 
cloud the ultimate goal. It is important always to remember that tests are there to 
evaluate a systems pef&rmance rather than become the task itself. Just as techniques 
can overtrain on data r^Ls'possible to over train on the test data and/ or methodology 
too thus loosing the generality and purpose of the evaluation. 

In speech recognition a smuale (though naive) measure of phones or words correct 
gives a reasonable mdicatorsMbow well a speech recognition system works. In syn- 
thesis this a lot harder. A worifp&n have multiple pronunciations, so it is much harder 
to automatically test if a synthesiser's phoneme accuracy, besides much of the qual- 
ity is not just in if it is correct butjf it "sounds good". This is effectly the crux of the 
matter. The only real synthesis evaluation technique is having a human listen to the 
result. Humans individually are n<$$Very reliably testers of systems, but humans in 
general are. However it is usually not^asible to have testers listen to large amounts 
of synthetic speech and return a general goodness score. More specific tests are re- 
quired. \ ^ 

Although listening tests are the ultimate, because they are expensive in resources 
(undergraduates are not willing to listing Cp^bad synthesis all day for free), and the 
design of listening tests is a non-trivial task, Ijile-re are a number of more general tests 
which can be run at less expenses and can helr^j-eatly. 

It is common that a new voice in Festival (or any other speech synthesis systems), has 
limitations and it is wise to test what the limitations^e and decide if such limitations 
are acceptable or not. This depends a lot on what yod wish to use your voice for. For 
example if the voice a Scottish English voice to be 'primarily used as the output of 
a Chinese speech tranlation system, the vocabulary is/cTsnstained by the translation 
system itself so a large lexicon is probabaly not much ofa« issue, but the vocabulary 
will include many anglosized (calenodianized ?) versi®D#T>f Chinese names, which 
are not common in standard English so letter-to-sound rl^fe should be made more 
sensitive for that input. If the system is to be used to read^etctress lists, it should be 
able to tokenize names and address appropriately, and if it is To-be used in a dialogue 
system the intonation model should be able to deal with questtWlsand continuations 
properly. Optimizing your voices for the most common task, ^md minimizing the 
errors is what evaluation is for. 



Does it work at all? 



It is very easy to build a voice and get it to say a few phrases and think'that the job 
is done. As you build the voice it is worth testing each part as you built it to en- 
sure it basically performs as expected. But once its all together more general tests are 
needed. Before you submit it to any formal tests that you will use for benchmarking 
and grading progrees in the voice, more basic tests should be carried out. 



237 



Chapter 15. Evaluation and Improvements 



In fact it is stating such initial tests more concretely. Every we have ever built has 
always had a number mistakes in it that can be trivially fixed. Such as the mfccs 
were not generated after fixing the pitchmarks. Therefore you syould go through 
each stage of the build procedure and ensure it really did do what you though it 
should do, especially if you are totally convinced that section worked perfectly. 

Try to find around 100-500 sentences to play through it. It is amazing home many 
general problems are thrown up when you extend your test set. The next stage is 
to play so real text. That may be news text from the web, output from your speech 
translation system, or some email. Initially it is worth just synthesizing the whole set 
without e^ji listening to it. Problems in analysis and missing diphones etc may be 
shown up jujj£in the processing of the text. Then you want to listen to the output and 
identify probfejns. This make take some amount of investigation. What you want to 
do is identify iZk&re the problem is, is it bad tex analysis, bad lexical entry, a prosody 
problem, or a waveform synthesis problem. You may need to synthesizes parts of the 
text in isolation (e.jr\using the Festival function SayText and look at the structure of 
the utterance generated, e.g. using the function utt . features. For example to see 
what words have be^« identified from the text analysis 

(utt.features uttl 'Wordf^name)) 
Or to see the phones genC^ed 

(utt.features uttl 'Segment '(nlnjife)) 

Thus you can view selected p^frts^pf an utterance and find out if it is being created 
as you intended. For some things ^graphical display of the utterance may help. 

Once you identify where the problem is you need to decide how to fix it (or if it is 
worth fixing). The problem may be a^umber of different places: 

• Phonetic error: the acoustics of a unit ck5e6n't match the label. This may be because 
the speaker said the wrong word /phoneme or the labeller had the wrong. Or pos- 
sible some other acoustic variant that hasXujtbeen considered 

• Lexical error: the word is pronounced with the wrong string of 
phonemes/stress/ tone. Either the lexical e^^y is wrong or the letter to sound 
rules are not doing ht right thing. Or there are multiple valid pronunciations for 
that word (homographs) and the wrong one ^ss^lectec because the homograph 
disambiguation is wrong, or there is not a disamhiguator. 

• Text error: the text analysis doesn't deal properly^-with the word. It may be that 
a punctuation system is spoken (or not spoken) as ejected, titles, symbols, com- 
pounds etc aren't dealt with properly ^^-n 

• Some other error: some error that is not one of the abjji'e. As you progress in cor- 
rection and tuningm errors in the category will grow an^^ou must find some way 
to avoid such errors. 

Before rushing out and getting one hundred people to listen /COwour new synthetic 
voice, it is worth doing significant internal testing and evaluatira^rinf ormally to find 
errors and test them. Remember the purpose of evaluation in thisvcase is to find errors 
and fix them. We are not, at least not at this stage, evaluating the voices on an abstract 
scale, where unseen test data, and blind testing is important. 

% 

Formal Evaluation Tests V 

Once you yourself and your immediate colleages have tests the voice you will want 
more formal evaluation metrics. Again we are looking at diagnositic evluation, com- 
parative eveluation between different commercial synthesizers is quite a different 
task. 



238 



Chapter 15. Evaluation and Improvements 



In our English checks we used Wall Street Journal and Time magazine articles 
(around 10 millions words in total). Many unusual words apear only in one article 
(e.g proper names) which are less important to add to the lexicon, but unusual 
words that appear across articales are more likely to appear again so should be 
added. 

Be aware that using data will cause your coverage to be biased towards that type 
of data. Our databases are mostly collected in the early 90s and hence have good 
coverage for the Gulf War, and the changes in Eastern Europe but our ten million 
words haveno occurences of the words "Sojourner" or "Lezuinski" whcih only appear 
in stories lifer in the decade. 



A script is p^xided in src/general/f ind_unknowns which will analyze given text 
to find which swords do not appear in the current lexicon. You should use the -eval 
option to specify the selection of your voice. Note this checks to see which words 
are not in the lexicon itself, it replaces what ever letter- to-sound/ unknown word 
function you spedjjji^d and saves any words for which that function is called in the 
given output file. F^f^xample 

find_unknowns -eval^vpa£e_ked_diphone)' -output cmudict. unknown \ 
wsj / wsj-raw / 00 / y' 

Normally you would rurytffais over your database then cummulate the unknown 
words, then rerun the unknown words synthesizing each and listening to them to 
evaluate if your LTS system p£0duces reasonable results. Fur those words which do 
have acceptable pronunciationsLadcl them to your lexicon. 

Sematically unpredictable Sentences 

One technique that has been used tcCevaluation speech synthesis quality is testing 
against semantically unpredictable serti^rices. 

0/ O/ 0/ O/ O/ O/ 0/ O/ 0/ O/ 0/ O/ 0/ O/ 0/ O/ 0/ O/ 0/ Of 0/ O/ j> 

/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o . 

Discussion to be added **0c 

O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ O/ 0/ O/ O/ Of f\ 

to /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o V 

(5) 



'•6 



Debugging voices 



% 



239 



Chapter 15. Evaluation and Improvements 




140 



Chapter 16. Markup 

SABLE, JSML, VoiceXML, Docbook, tts_modes 




242 



Chapter 16. Markup 




142 



Chapter 17. Concept-to-speech 



Dialog systems, speech-to-speech translation, getting more that just text. 




243 



Chapter 17. Concept-to-speech 




144 



Chapter 18. Deployment 



client/ server access, footprint, scaling up, backoff strategies, signal commpression. 




245 



Chapter 18. Deployment 




146 



Chapter 19. A Japanese Diphone Voice 



In this chapter we work through a full example of creating a voice given that most 
of the basic construction work (model building) has been done. Pariticularly this dis- 
cusses the scheme files, and conventions for keeping a voices together and how you 
can go about packaging it for general use. 

Ultimately a voice in Festival will consist of a diphone database, a lexicon (and Its 
rules) and a number of scheme files that offer the complete voice. When people other 
than the developer of a voice wish to use your newly developed voice it is only that 
small set defiles that are required and need to be distributed (freely or otherwise). 
By conventiejgjWe have distributed diphone group files, a single file holding the in- 
dex, and dipnghe data itself, and a set scheme files that describe the voice (and its 
necessary modeft). 

Basic skeleton ffles_are included in the festvox distribution. If you are unsure how 
to go about buildiog Jjne basic files it is recommended you follow this schema and 
modify these to youfj>articular needs. 

By convention a voic^^me consist of an institution name (like emu, cstr, etc), if you 
don't have an insitutioi^fast use net. Second you need to identify the language, there 
is an ISO two letter standard for it fails to distinguish dialects (such as US and UK En- 
glish) so it need not be stricfly followed. However a short identifier for the language 
is probably prefered. ThirdQ^bu identify the speaker, we have typically used three 
letter initials which are the injtfals of the person speaker but any name is reasonable. 
If you are going to build a USot^JK English voice you should look Chapter 20. 

The basic processes you will need^o address 

construct basic template files 
generate phoneset definition 
generate diphone schema file 
generate prompts 

record speaker ^ 
label nonsense words ^ 
extract picthmarks and LPC coeffcient 
test phone synthesis 
add lexicon/LTS support 

add tokenization o 
add prosody (phrasing, durations and intonation) 
test and evaluate voice 

package for distribution ^ 

As with all parts of festvox: you must set the following enviroment variables to 
where you have installed versions of the Edinburgh Speech Tools^d the festvox 
distribution V * 

9s 

export ESTDIR= /home/ awb /projects / 1.4.1/ speech_tools Vp 
export FESTVOXDIR=/home / awb/ projects/ festvox 



In this example we will build a Japanese voice based on awb (a gaijin). First create a 
directory to hold the voice. 



\5 



147 



Chapter 19. A Japanese Diphone Voice 



mkdir ~/data/cmu_ja_awb_diphone 
cd ~/data/cmu_ja_awb_diphone 

You will need in the regions of 500M of space to build a voice. Actually for Japanese 
its probably considerably less, but you must be aware that voice building does re- 
quire disk space. 

Construct the basic directory structure and skeleton files with the command 

$FESTVCW(DIR/ src/diphones/ setup_diphone emu ja awb 

The three a^uments are, institution, language and speaker name. 

The next stagi^js define the phoneset in f estvox/cmu_ja_phones . scm. In many 
cases the phoneset for a language has been defined, and it is wise to follow 
convention when it exists. Note that the default phonetic features in the skeleton file 
may need to be tr£kiified for other languages. For Japanese, there are standards 
and here we use at^t similar to the ATR phoneset used by many in Japan for 
speech processing. Y UThis file is included, but not automatically installed, in 
$FESTVOXDIR/src/vd*Cdiphone/ japanese 

Now you must write ths^qde that generates the diphone schema file. You can look 
at the examples in fest\^A/src/diphones/*_schema . scm. This stage is actually 
the first difficult part, getti^) thsi right can be tricky. Finding all possible phone- 
phone in a language isn't as e^a&y as it seems (especially as many possible ones don't 
actually exist). The file f est«5x/ ja_schema . scm is created providing the function 
diphone-gen-list which returns a list of nonsense words, each consisting of a list 
of, list of diphones and a list of pft&fies in the nonsense word. For example 

festival> (diphone-gen-list) 

((("k-a" "a-k") (pau t a k a k a pau)) v 

(("g- a " " a -g") (pau t a g a g a pau)) \> * 

(("h-a" "a-h") (pau t a h a h a pau)) 

(("p-a" "a-p") (pau t a p a p a pau)) 

(("b-a" "a-b") (pau t a b a b a pau)) \> 

(("m-a" "a-m") (pau t a m a m a pau)) 

(("n-a" "a-n") (pau t a n a n a pau)) 

...) (J) 

In addition to generating the diphone schemaJ^e ja_schema . scm also should 
provied the functions Diphone_Prompt_Setup, which is called before generating the 
prompts, and Diphone_Prompt_Word, which is called^Before waveform synthesis of 
each nonsense word. £j 

Diphone_Prompt_Setup, should be used to select a speaK^ to generate the prompts. 
Note even though you may not use the prompts when recording they are necessary 
for labeling the spoken speech, so you still need to generate them. If you haeva syn- 
thesizer already int eh language use ti to generate the prom > pigjassuming you can get 
it to generate from phone lists also generate label files). Oftei^T^ie MBROLA project 
already has a waveform synthesizer for the language so you can jj^e that. In this case 
we are going to use a US English voice (kal_diphone) to generate the prompts. For 
Japanese that's probably ok as the Japanese phoneset is (mostly) * subset of the En- 
glish phoneset, though using the generated prompts to prompt the(u^er is probably 
not a good idea. ^-^ 

The second function Diphone_Prompt_word, is used to map the Japanese^Jhone set to 
the US English phone set so that waveform synthesis will work. In this case a simple 
map of Japanese phone to one or more English phones is given and the code simple 
changes the phone name in the segment relation (and adds a new new segment in 
the multi-phone case). 

Now we can generate the diphone schema list. 



148 



Chapter 19. A Japanese Diphone Voice 



festival -b festvox/ diphlist.scm festvox/ja_schema.scm \ 
'(diphone-gen-schema "ja" "etc/jadiph.list")' 

Its is worth checking etc / jadiph . list by hand to you are sure it contains all the 
diphone you wish to use. 

The diphone schema file, in this case etc/ jadiph . list, is a fundamentally key file 
for almost all the following scripts. Even if you generate the diphone list by some 
method other than described above, you should generate a schema list in exactly this 
format so that everything esle will work, modifying the other scripts for some other 
format is aJ^ost certainly a waste of your time. 

The schema<filg has the following format 

( ja_0001 "pau t a k a k a pau" ("k-a" "a-k")) 
( ja_0002 "pau A g a g a pau" ("g-a" "a-g") ) 
( ja_0003 "pau t a^h a pau" ("h-a" "a-h") ) 
( ja_0004 "pau t a p^R a pau" ("p-a" "a-p") ) 
( ja_0005 "pau tab a^D a pau" ("b-a" "a-b") ) 
( ja_0006 "pau t a m a^fe pau" ("m-a" "a-m") ) 
( ja_0007 "pau t a n a n ^pau" ("n-a" "a-n") ) 
( ja_0008 "pau t a r a r a p^uT ("r-a" "a-r") ) 
( ja_0009 "pau t a t a t a pa^ ("t-a" "a-t") ) 

In this case it has 297 nonsense words. 

Next we can generate the prornp'ts and their label files with the following command 
The to synthesize the prompts v O 

festival -b festvox/diphlist.scm festvox/^_schema.scm \ 

'(diphone-gen-waves "prompt-wav^"prarnpt-lab" "etc/jadiph.list")' 

Occasionally when you are building tn^prompts some diphones requested in the 
prompt voice don't actually exists (especially when you are doing cross-language 
prompting). Thus the generated prompt haV^Qjne default diphone (typically silence- 
silence added). This is mostly ok, as long as^tynot happening multiple times in the 
same nonsence word. The speaker just shouli^be aware that some prompts aren't 
actually correct (which of course is going to be true for all prompts in the cross- 
language prompting case). • > 

The stage is to record the prompts. See the Section c&Ke&Recording under Unix in Chapter 4 
for details on how to do this under Unix (and in fac^cither techniques too). This can 
done with the command 

bin/prompt_them etc/jadiph.list 

Depending on whether you want the prompts actually to > 4^jMayed or not, you can 
edit bin/prompt_them to comment out the playing of the pro^^ts. 

Note a third argument can be given to state which nonse worcHo begin prompting 
from. This if you have already recorded the first 100 you can continue with 



bin/prompt_them etc/jadiph.list 101 

The recorded prompts can the be labeled by 

bin/make_labs prompt-wav/*.wav 
And the diphone index may be built by 



% 



149 



Chapter 19. A Japanese Diphone Voice 



bin/ make_diph_index etc/awbdiph.list die/ awbdiph.est 



If no EGG signal has been collected you can extract the pitchmarks by 

bin/make_pm_wave wav/*.wav 
If you do have an EGG signal then use the following instead 

bin/ makaZ^rn lar/*.lar 

A program tgAnove the predicted pitchmarks to the nearest peak in the waveform 
is also provider!^ This is almost always a good idea, even for EGG extracted pitch 
marks • 

bin/ make_pm_fix pfw *.pm 

Getting good pitcitrn^rks is important to the quality of the synthesis, see 
the Section called Extrdmkg pitchmarks from waveforms in Chapter 4 for more 
discussion. ^ 

Because there is often a pojft^r mismatch through a set of diphone we provided a 
simple method for finding v^fet general power difference exist between files. This 
finds the mean power for eacfrvqwel in each file and calculates a factor with respect 
to the overal mean vowel po\^r. A table of power modifiers for each file can be 
calculated by v^) 

bin/ find_powerfactors lab/*.lab 

The factors cacluated by this are savefc^fn etc/powfacts. 

Then build the pitch-synchronous LPC coefficients, which used the power factors if 
they've been calculated. (\ 

bin/ make_lpc wav / *. wav 

This should get you to the stage where you can te$^he basic waveform synthesizer. 
There is still much to do but initial tests (and corrosion of labeling errors etc) can 
start now. Start festival as 

festival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja\Jfc^b diphone)" 
and then enter string of phones vt) 
festival> (SayPhones '(pau koNnichiwa pau)) ^^>^ 

• 

In addition to the waveform generate part you must also provide (tekt analysis for 
your language. Here, for the sake of simplicity we assume that the j£j5&nese is pro- 
vided in romanized form with spaces between each word. This is of cjWse not the 
case for normal Japanese (and we are working on a proper Japanese fron^Sid). But at 
present this shows the general idea. Thus we edit f estvox/cmu_ja_token . sera and 
add (simple) support for numbers. 

As the relationship between romaji (romanized Japanese) and phones is almost trivial 
we write a set of letter-to-sound rules, by hand that expand words into their phones. 
This is added to f estvox/cmu_ja_lex . scm. 



250 



Chapter 19. A Japanese Diphone Voice 



For the time being we just use the default intonation model, though simple rule drive 
improvements are possible. See f estvox/cmu_ja_awb_int . scm. For duration, we 
add a mean value for each phone in the phoneset to f extvox/cmu_ja_awb_dur . scm. 

These three Japanese specific files are included in the distribution in 

f estvox/src/vox_diphone/ japanese/. 

Now we have a basic synthesizer, although there is much to do, we can now type 
(romanized) text to it. 

festival fg^vox/ cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)" 



festival> (S^Text "boku wa gaijin da yo.") 

The next part is tOrtest and improve these various initial subsystems, lexicons, text 
analysis prosody, ancLtorrect waveform synthesis problem. This is ane endless task 
but you should sper\cPsignificantly more time on it that we have done for this exam- 
Pie- <$) 

Once you are happy w4fp) the completed voice you can package it for distribution. 
The first stage is to generals a group file for the diphone database. This extracts the 
subparts of the nonsense words and puts them into a single file offering something 
smaller and quicker to accessv'The groupfile can be built as follows. 

v , 

festival festvox/cmu_ja_awb_di^one.scm "(voice_cmu_ja_awb_diphone)" 
festival (us_make_group_file "group)' awblpc.group" nil) 

The us_ in the function names stands forjJniSyn (the unit concatenation subsystem 
in Festival) and nothing to do with US^nglish. 

To test this edit f estvox/cmu_ja_awb\di ) phone . scm and change the choice of 
databases used from separate to grouped. (This is done by commenting out the line 
(around line 81) 

(set! cmu_ja_awb_db_name (us_diphone_init cmu^a_awb_lpc_sep)) 
and uncommented the line (around line 84) 

(set! cmu_ja_awb_db_name (us_diphone_init cmu_ja_aw^^pc_group)) 

The next stage is to integrate this new voice so that f^j^jWal may find it automati- 
cally. To do this you should add a symbolic link from the wfije directory of Festival's 
English voices to the directory containing the new voice. £P&t cd to festival's voice 
directory (this will vary depending on where your version ukfg^tival is installed) 

cd /home/awb/projects/1.4.1/festival/lib/voices/japanese/ 

creating the language directory if it does not already exists. AcTd^a symbolic link 
back to where your voice was built v * 

In -s / home / awb / data / cmu_j a_awb_diphone 

Now this new voice will be available for anyone runing that version festival started 
from any directory, without the need for any explicit arguments 

festival 

festival> (voice_cmu_ja_awb_diphone) 

252 



Chapter 19. A Japanese Diphone Voice 



festival> (SayText "ohayo gozaimasu.") 



The final stage is to generate a distribution file so the voice may be installed on 
other's festival installations. Before you do this you must add a file copying to the 
directory you built the diphone database in. This should state the terms and condi- 
tions in which people may use, distribute and modify the voice. 

Generate the distribution tarfile in the directory above the festival installation (the 

one where^estival/ and speech_tools/ directory is). 

cd /home/a^b/projects/1.4.1/ 

tar zcvf festv^Jcmu_ja_awb_lpc.tar.gz \ 
festival/lib/vokes/japanese/ cmu_ja_awb_diphone/festvox/*.scm \ 
festival/lib/ voices/japanese/ cmu_ja_awb_diphone/ COPYING \ 
festival/lib/voiXeVVjapanese/ cmu_ja_awb_diphone/group/awblpc. group 

The completed files nJpiR building this crude Japanese example are available at 
http: / / festvox.org/ exartf^^s/cmu_ja_awb_diphone/. 

Notes 

1 . http: / /festvox.org/ examples /cmu_ja_awb_diphone / 

\ ' 



'•6 



o 



% 



252 



Chapter 20. US/UK English Diphone Synthesizer 



When building a new diphone based voice for a supported language, such as English, 
the upper parts of the systems can mostly be taken from existing voices, thus making 
the building task simpler. Of course, things can still go wrong, and its worth checking 
everything at each stage. This section gives the basic walkthrough for building a new 
US English voice. Support for building UK (southern, RP dialect) is also provided this 
way. For building non-US/UK synthesizers see Chapter 19 for a similar walkthrough 
but also covering the full text, lexicona nd prosody issues which we can subsume in 
this examc 



Recording a^hole diphone set usually takes a number of hours, if everything goes 
to plan. Construction of the voice after recording may take another couple of hours, 
though much ©Hhis is CPU bound. Then hand-correction may take at least another 
few hours (depending on the quality). Thus if all goes well it is possible to construct a 
new voice in a daj^work though usually something goes wrong and it takes longer. 
The more time you^sp&nd making sure the data is correctly aligned and labeled, the 
better the results witr be. While something can be made quickly, it can take much 
longer to do it very wQjP. . 

For those of you who ha^ve ignored the rest of this document and are just hoping to 
get by by reading this, gcfSj luck. It may be possible to do that, but considering the 
time you'll need to invest fatfxuild a voice, being familar with the comments, at least 
in the rest of this chapter, rnavjpe well worth the time invested. 

The tasks you will need to do me^ ♦ 

construct basic template files 

generate prompts y< 

record nonsense words 

autolabel nonsense words 

generate diphone index 

generate pitchmarks and LPC coefficients ^\ 

Test, and hand fix diphones ^ 

Build diphone group files and distribution • 



As with all parts of f est vox, you must set the follctfving environment variables to 
where you have installed versions of the Edinburgh (^j)eech Tools and the festvox 
distribution 

export ESTDIR= /home/ awb / projects / 1.4.1/ speech_tools ^^i. 
export FESTVOXDIR=/home / awb/ projects/ festvox V\J 

The next stage is to select a directory to build the voice. You^^l^need in the order 
of 500M of diskspace to do this, it could be done in less, but its bstter to have enough 
to start with. Make a new directory and cd into it • 

o 

mkdir ~ / data / cmu_us_awb_diphone (~\ 
cd~/data / cmu_us_awb_diphone 

By convention, the directory is named for the institution, the language (here, us 
English) and the speaker (awb, who actually speaks with a Scottish accent). Although 
it can be fixed later, the directory name is used when festival searches for available 
voices, so it is good to follow this convention. 

Build the basic directory structure 



253 



Chapter 20. US/UK English Diphone Synthesizer 



$FESTVOXDIR/ src/ diphones/ setup_diphone emu us awb 

the arguments to setup_diphone are, the institution building the voice, the lan- 
guage, and the name of the speaker. If you don't have a institution we recommend 
you use net. There is an ISO standard for language names, though unfortunately it 
doesn't allow distinction between US and UK English, so in general we recommend 
you use the two letter form, though for US English use us and UK English use uk. 
The speaker name may or may nor be there actual name. 

The setup script builds the basic directory structure and copies in various skeleton 
files. For l^guages us and uk it copies in files with much of the details filled in for 
those languages, for other languages the skeleton files are much more skeletal. 

For constructrja£ a u s voice you must have the following installed in your version of 
festival 

festvox. kallpcl6l5^ i 

festlex_POSLEX 

festlex_CMU 

And for a UK voice ycMa^heed 

festvox_rablpcl6k /T\ 
festlex_POSLEX vV. 
festlex OALD O 

V; 

At run-time the two appropriate festlex packages (POSLEX + dialect specific lexicon) 
will be required but not the existing kal/ rab voices. 

To generate the nonsense word list^^ y, 

festival -b festvox/ diphlist.scm festvox /\*s_schema.scm \ 
'(diphone-gen-schema "us" "etc/usdiph^J")' 

We use a synthesized voice to buil^^vaveforms of the prompts, both for 
actual prompting and for alignment. If y^jtr* want to change the prompt voice 
(e.g. to a female) edit f estvox/us_schemsyrscm. Near the end of the file is 
the function Diphone_Prompt_Setup. By default (for US English) the voice 
(voice_kal_diphone) is called. Change that, and the FO value in the following line, 
if appropriate, to the voice use wish to follow. v ('j 

Then to synthesize the prompts 

o 

festival -b festvox/ diphlist.scm festvox/ us_schema.scm \ 

'(diphone-gen-waves "prompt-wav" "prompt-lab" "etcA»m^>h.list")' 

*6 

Now record the prompts. Care should be taken to set up the i^c)3rding environment 
so it is best. Note all power levels so that if more than one session^ required you can 
continue and still get the same recording quality. Given the length of the US English 
list, its unlikely a person can say allow of these in one sitting without taking breaks 
at least, so ensuring the environment can be duplicated is importanl^eyen if it's only 
after a good stretch and a drink of water. 

bin/ prompt_them etc / usdiph.list V *^ > 

Note a third argument can be given to state which nonse word to begin prompting 
from. This if you have already recorded the first 100 you can continue with 



bin/prompt_them etc /usdiph.list 101 



254 



Chapter 20. US/UK English Diphone Synthesizer 



See the Section called US phoneset in Chapter 30 for notes on pronunciation (or 
the Section called UK phoneset in Chapter 30 for the UK version). 

The recorded prompts can the be labeled by 

bin/make_labs prompt-wav/*.wav 
Its is always worthwhile correcting the autolabeling. Use 

emulabel^tc/ emu_lab 



and select open from the top menu bar and the place the other dialog box and 
clink inside it add hit return. A list of all label files will be given. Double-click on each 
of these to see me labels, spectragram and waveform. (** reference to "How to correct 
labels" required**). 

Once the diphone^^^s have been corrected, the diphone index may be built by 
bin/ make_diph_ind^^tc/usdiph.list dic/awbdiph.est 

V 

If no EGG signal has beerrcellected you can extract the pitchmarks by (though read 
the Section called Extractingyntdimarks from waveforms in Chapter 4 to ensure you are 
getting the best exteraction). \ J 

bin/make_pm_wave wav/*.wav 
If you do have an EGG signal thenjlrse, the following instead 

bin/make_pm lar/*.lar \^ 



A program to move the predicted pitchmarks to the nearest peak in the waveform 
is also provided. This is almost always a gpbd idea, even for EGG extracted pitch 
marks 

CO 

bin/make_pm_fix pm/*.pm 

Getting good pitchmarks is important to tl(^quality of the synthesis, see 
the Section called Extracting pitchmarks from waveformsMv Chapter 4 for more 
discussion. ^ 

Because there is often a power mismatch through a seV&f diphone we provided a 
simple method for finding what general power differ&oee exist between files. This 
finds the mean power for each vowel in each file and calciS^tes a factor with respect 
to the overall mean vowel power. A table of power mod^eijs for each file can be 
calculated by f-^ 

<\ 

bin/ find_powerfactors lab/*.lab \ 

• 

The factors calculated by this are saved inetc/powfacts. 

Then build the pitch-synchronous LPC coefficients, which use the p^rer factors if 
they've been calculated. 

bin/make_lpc wav/*.wav 



Now the database is ready for its initial tests. 



255 



Chapter 20. US/UK English Diphone Synthesizer 



festival festvox/cmu_us_awb_diphone.scm '(voice_cmu_us_awb_diphone)' 

When there has been no hand correction of the labels this stage may fail with di- 
phones not having proper start, mid and end values. This happens when the auto- 
matic labeled has position two labels at the same point. For each diphone that has 
a problem find out which file it comes from (grep for it in dic/awbdiph . est and 
use emu label to change the labeling to as its correct. For example suppose "ah-m" is 
wrong you'll find is comes from us_0314. Thus type 

emulabei^tc/ emu_lab us_0314 



After corf^ting labels you must re-run the make_diph_index command. You 
should also r«$un the f ind_powerf acts stage and make_lpc stages as these too 
depend on the labels, but this takes longer to run and perhaps that need only be 
done when you've corrected many labels. 

To test the voice's oi^. functionality with 



festival> (SayPhones \pau>hh ax 1 ow pau)) 
festival> (intro) \J0 

As the autolabeling is unli^^y to work completely you should listen to a number of 
examples to find out what di^pnes have gone wrong. 

Finally once you have correctadAhe errors (did we mention you need to check and 
correct the errors?), you can bujjjaka final voice suitable for distribution. First you 
need to create a group file which contains only the subparts of spoken words which 
contain the diphones. \y 

festival f estvox / cmu_us_awb_diphone . s^i * ( voice_cmu_us_awb_diphone) ' 
festival (us_make_group_file " group /awbljpc^jjroup" nil) 



The us_ in this function name confusingly stands for UniSyn (the unit concatenation 
subsystem in Festival) and nothing to do with English. 

To test this edit f estvox/cmu_us_awb_dipho*e . acm and change the choice of 
databases used from separate to grouped. This is^dane by commenting out the line 
(around line 81) ^> 

(set! cmu_us_awb_db_name (us_diphone_init cmu_us_a\ib^ljx:_sep)) 
and uncommented the line (around line 84) 




(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_g^eup)) 

The next stage is to integrate this new voice so that festival can it automatically. 
To do this, you should add a symbolic link from the voice directory of Festival's 
English voices to the directory containing the new voice. First ccfto-festival's voice 
directory (this will vary depending on where you installed festival) v * 

cd /home/awb/projects/1.4.1/festival/lib/voices/english/ 
add a symbolic link back to where your voice was built 

In -s /home/ awb / data/ cmu_us_awb_diphone 

Now this new voice will be available for anyone runing that version festival (started 
from any directory) 

256 



Chapter 20. US/UK English Diphone Synthesizer 



festival 

festival> (voice_cmu_us_awb_diphone) 
festival> (intro) 

The final stage is to generate a distribution file so the voice may be installed on 
other's festival installations. Before you do this you must add a file copying to the 
directory you built the diphone database in. This should state the terms and condi- 
tions in wme]~i people may use, distribute and modify the voice. 

Generate th^iistribution tarfile in the directory above the festival installation (the 
one where festival/ and speech_tools/ directory is). 

cd /home/awb^pEojects/1.4.1/ 

tar zcvf festvox_aonus_awb_lpc.tar.gz \ 
festival/lib/ voicee^anglish/ cmu_us_awb_diphone/ festvox/*.scm \ 
festival/lib/voicesYenglish/ cmu_us_awb_diphone/ COPYING \ 
festival/lib/ voices/epgljsh/ cmu_us_awb_diphone/ group /awblpc. group 



The complete files from buiMing an example US voice based on the KAL recordings 
is available at http: / / festvoVojss/ examples /cmu_us_kal_diphone/. 



Notes 



1 . http: / / festvox.org/ examples /cmu_us_kal_diphone / 





o 





257 



Chapter 20. US/UK English Diphone Synthesizer 




158 



Chapter 21. Idom full example 

domain analysis, prompt designing build /walkthrough and debugging, weather or 
stocks or such like. 




259 



Chapter 21. Mom full example 




160 



Chapter 22. Non-english Idom example 

Cross linguistic build with minimal local language support. 




Chapter 22. Non-english Idom example 




162 



Chapter 23. Concluding remarks and future 



Where will it all lead to. 




263 



Chapter 23. Concluding remarks and future 




164 



Chapter 24. Festival Details 

This chapter offers descriptions of various Festival internals including APIs. 




265 



Chapter 24. Festival Details 




166 



Chapter 25. Festival's Scheme Programming Language 



This chapter acts as a reference guide for the particular dialect of the Scheme pro- 
gramming language used in the Festival Speech Synthesis systems. The Scheme pro- 
gramming language is a dialect of Lisp designed to be more consistent. It was chosen 
for the basic scripting language in Festival because: 



• it is a very easy language for machines to parse and interpret, thus the foot print 
for the imerpreter proper is very small 

• it offers garage collection making managing objects safe and easy. 

• it offers a ge^ral consistent datastructure for representing parameters, rules etc. 

• it was f amilia* to the authors 

• its is suitable as an embedded system 

Having a scripting J^prguage in Festival is actually one of the fundamental properties 
that makes Festival a (^ful system. The fact that new voices and languages in many 
cases can be added wi^haut changing the underlying C++ code makes the system 
mouch more powerful Sfnd. accessible than a more monolithic system that requires 
recompilation for any parameter changes. As there is sometimes confusion we should 
make it clear that Festival c^Qjtains its own Scheme interpreter as part of the system. 
Festival can be view as a Scheprfe interpreter that has had basic addition to its function 
to include modules that can do speech synthesis, no external Scheme interperter is 
required to use Festival. \~. 

The actual interpreter used in Festival is based on George Carret's SIOD, "Scheme 
in one Defun". But this has been sul^ikntially enhanced from its small elegant begin- 
nings into something that might be barter called "Scheme in one directory". Although 
there is a standard for Scheme the version in Festival does not fully follow it, for both 
good and bad reasons. Thus finding in srder for people to be able to program in Fes- 
tival's Scheme we provide this chapter t^list the core type, functions, etc and some 
examples. We do not pretend to be teaching^ogramming here but as we know many 
people who are interested in building voices^e not primarily programmers, some 
guidance on the language and its usage wilKmake the simple programming that is 
required in building voices, more accessible. QO 

For reference the Scheme Revised Revised Revise*! report describes the standard def- 
inition [srrrr90]. For a good introduction to programming in general that happens to 
use Scheme as its example language we recommendJaJ&elson85]. Also for those who 
are unfamiliar with the use of Lisp-like scripting languages we recommend a close 
look as GNU Emacs which uses Lisp as its underlying soapting language, knowledge 
of the internals of Emacs did to some extent influence tHeXcripting language design 
of Festival. 

Overview O 

"Lots of brackets" is what comes to most people's minds when considering Lisp and 
its various derivatives such as Scheme. At the start this can seemtdaunting and it is 
true that parenthesis errors can cuase problems. But with an editor (that does proper 
bracket matching, brackets can actually be helpful in code structure--*ather than a 
hindrance. 



The fundamental structure is the s-expression. It consists of an atom, oVa list of s- 
expressions. This simply defined recursive structure allows complex structures to 
easily be specified. For example 

3 

(12 3) 
(a(bc)d) 



267 



Chapter 25. Festival's Scheme Programming Language 



((ab) (de)) 

Unlike other programming languages Scheme's data and code are in the same for- 
mat, s-expressions. Thus s-expression are evaluated, recursively. 



Symbols: 

are treated as variables and evaluated return their currently set value. 



evalutirEe^to themselves. 



Strings and numbers: 
t^to 

Lists: <^ 

The each member of the list is evaluated and the first item in the list is treated as a 
function and^plied using the remainer of the list as arguments to the function. 

Thus the s-expressiefn 
(+12) 

when evaluated will re(5i^n 3 as the symbol + is bound to a function that adds it 
arguments. ^ 

Variables may be set using th^iyst ! function which takes a variable name and a value 
as arguments \* 

v\ 

(set! a 3) V 

The set ! function is unusual in that^does not evaluate its first argument. If it did 
you have to explcitly quote it or set some, other variable to have a value of a to get 
the desired effect. 

quoting, define \ > 

Data Types (5) 

There a number of basic data types in this Scherrte, new ones may also be added but 
only through C++ functions. This basic types are v (_) 

Symbols: Q 

symbols atoms starting with an alphabetic character. Unlike numbers and 
strings, they may be used as variables. Examples are 

a bed f6 myfunc 
plus cond 

Symbols may be created from strings by using the functioi^^ntern 



Numbers: 



s^i^c{ii 



In this version of scheme all numbers are doubles, there is no distirjejion between 
floats, doubles and ints. Examples are 

1.4 

3.14 

345 

3456756.4345476 

Numbers evaluate to themselves, that is the value of the atom 2 is the number 
2. 



168 



Chapter 25. Festival's Scheme Programming Language 



Strings: 

Strings are bounded by the double quote characters " . For example 

"a" 
"abc" 

"This is a string" 

Strings evaluate to themselves. They may be converted to symbols with the 
function intern. If they are strings of characaters that represent numbers you 
can conyert a string to a number with the function parse-number. For example 

(intern/*abc") => abc 
(parse^tdmber "3.14") => 3.14 

Althougrr^ou can make symbols from numbers you should not do that. 

Double quolesjriay be specified within a string by escaping it with a backslash. 
Backslashes tharefore also require an escape backslash. That is, " ab \ " c " contains 
four charactersYa) b, " and c. "ab\\c" also contains four characters, a, b, \ and 
c. And "abWV'o^ontains five characters a, b, \, " and c. 

Lists or Cons v ^ 

Lists start with a left^pSSlrenthesis and end with a right parenthesis with zero or 
more s-expression betv(S^n them. For example 

(abc) ^ 

0 v 

(b(bd)e) y\ 
((the boy) saw (the girl (in (the'park)))) 

Lists can be made by various Sanctions most notably cons and list, cons 
returns a list whose first item is the first item in the list, standardly called its car, 
and whose remainder, standardly \s^rhed its cdr, is the second argument of cons. 

(cons 'a '(b c)) => (a b c) ^\ 
(cons '(a b) '(c d)) => ((a b) c d) O 

Functions: 

Functions may be applied explicity bu the fusw5aon apply or more normally as 
when the appear as the first item in a list to be evaluated. The normal way to 
define function is using the define function. Fo\example 

(define (ftoc temp) ^"^->. 
(/ (* (- temp 32) 5) 9)) v>*_ 



This binds the function to the variable ftoc. Functtons can also be defined 
anonymously which sometimes is convinient. 

(lambda (temp) ^~^v- 
(/(*(- temp 32) 5) 9)) C 

returns a function. • 

o 

Others: q 

other internal types are support by Festival's scheme including somfe inportant 
object types use for synthesis such as utterances, waveforms, items<etc. The are 
normally printed as in the form 

#<Utterance 6234> 
#%lt;Wave 1294> 

The rpint form is a convinience form only. Enter that string of characters will 
not allow a reference to that object. The number is unique to that object instance 

269 



Chapter 25. Festival's Scheme Programming Language 



(it is actually the internal address of the object), and can be used visually to note 
if objects are the same or not. 



Functions 



This section lists the basic functions in Festival's Scheme. It doesn't list them all (see 
the FestivaUmanual for that) but does highlight the key functions that you should 
normally 

Core functions 

These functions *are-±he basic functions used in Scheme. These include the structural 
functions for settrhgV^iriables, conditionals, loops, etc. 

(set! SYMBOL VALU^W 

Sets symbol to valK^) symbol is not evaluated, while value is. Example 

(set! a 3) 
(set! pi 3.14) 

(set! fruit '(apples pears bananas)) 
(set! fruit2 fruit) >* 

(define (FUNCNAME ARGO ARG1 . BODY) 

define a function called FUNCNAM^with specified arguments and body. 



(define (myadd a b) (+ a b)) 
(define (factorial a) 

(cond S\ 
((<a2) 1) 

(t (* a (factorial (- a 1)))))) 



6 



(if TEST TRUECASE [FALSECASE] 



'•6 



If the value of test is non-nil, evaluate TRUECASE^rnd return value else if present 
evaluate falsecase if present and return value, e^ return nil. 

(if (string-equal v "apples") Q\ 
(format t "It's an apple\n") f r\ 

(format t "It's not an apple\n")) v) » 

(if (member v '(apples pears bananas)) 
(begin p> 
(format t "It's a fruit (%s) \n" v) Jc 
'fruit) 
'notfruit) 

"o 

(cond (TESTO . BODY) (TEST1 . BODY) ...) O. 

A multiple if statement. Evaluates each test until a non-nil test is^found then 
evalues each of the expressions in that body return the value of the last one. 

(cond 

((string-equal v "apple") 
'ringo) 

((string-equal v "plum") 
'ume) 

270 



Chapter 25. Festival's Scheme Programming Language 



((string-equal v "peach") 

'momo) 

(t 

'kudamono) 



(begin . BODY ) 

This evaluates each s-expression in body and returns the value of the last s- 
expression in the list. This is useful for case where only one s-expression is ex- 
pecte^gtsut you need to call a number of functions, notably the i f function. 




(if (string»equal v "pear") 
(begif 1 

(format t "assuming it's a asian pear\n") 
'nashif 
'kudamortpj ^ . 

(or . DISJ) 

evalutate each disjuh|8tNLintil one is non-nil and return that value. 



(or (string-equal v "toriove") 

(string-equal v "turtle"^) 
(or (string-equal v "pear")>r 

(string-equal v "apple") \^ 

(< num_fruits 6)) y^*j 

\ 

(and . CONJ) \ 

evalutate each conjunct until one i&rriLand return that value or return the value 
of the last conjunct. C 

(and (< num_fruits 10) 



(> num_fruits 3)) 
(and (string-equal v "pear") 
(< num_fruits 6) 
(or (string-equal day "Tuesday") 
(string-equal day "Wednesday"))) 



List functions 



(car EXPR) 



O 



returns the "car" of expr, for a list this is the first item, for arpatom or the empty 
list this is defined to be nil. \ 

• 

(car'(ab)) => a f\ 
(car'((ab)cd))=>(ab) V* 
(car '(a (b c) d)) => a (J 
(car nil) => nil A 
(car 'a) => nil S 



171 



Chapter 25. Festival's Scheme Programming Language 



(cdr EXPR) 

returns the "cdr" of expr, for a list this is the rest of the list, for an atom or the 
empty list this is defined to be nil. 

(cdr '(a b)) => (b) 

(cdr '((a b) c d)) => (c d) 

(cdr '(a)) => nil 

(cdr '(a (b c))) => ((b c)) 

(cdr nil) => nil 

(cdr^a) => nil 

(cons EXPR0^JPR2) 

build a new%list whose "car" is expro and whose "cdr" is expri. 

(cons 'a '(b c^=^(a b c) 

(cons 'a ()) =>XaP 

(cons '(a b) '(cd^'((ab)cd)) 

(cons() '(a) => « 

(cons 'a 'b => (a . 

(cons nil nil) => (nil)^^ 

% 

(list . BODY) 

Form a list from each of the arguments 



ar-T- 

(list nil '(a b) '(a b)) => (nil (a b) (a^)' 



(list 'a 'b 'c) => (a b c) -r C\ 

(list '(a b) 'c 'd) => ((a b) c d) ^ 



(append . BODY) S 

Join each of the arguments (lists) into a siflfgle list 

(append '(a b) '(c d)) => (a b c d) $ 
(append '(a b) '((c d)) '(e f)) => (a b (c d) e f) % 
(append nil nil) => nil v/ «^ 
(append '(a b)) => (a b)) \J 
(append 'a 'b) => error 

o 

(nth N LIST) ^^f\ 

Return Nth member of list, the first item is the Oth mem^r. 

(nth 0 '(a b c)) => a (~) 
(nth 2 '(a b c)) => c 

(nth 3 '(a b c)) => nil C 

• 

o 

(nth_cdr N LIST) Q 

Return Nth cdr list, the first cdr is the Oth member, which is the listr^^lf . 

(nth 0 '(a b c)) => (a b c) 
(nth 2 '(a be)) => (c) 
(nth 1 '(a b c)) => (b c) 
(nth 3 '(a b c)) => nil 



272 



Chapter 25. Festival's Scheme Programming Language 



(last LIST) 

The last cdr of a list, traditionally this function has always been called last 
rather last_cdr 

(last '(a b c)) => (c) 

(last '(a b (c d))) => ((c d)) 



(reverse LIST) 

Returr?the list in reverse order 

(reverse,^ b c)) => (c b a) 

(reverse *fs)) => (a) 

(reverse '(»b (c d))) => ((c d) b a) 

(member ITEM LIST)^ 

Returns the cdr in JJ^T whose car is ITEM or nil if it found 

(member 'b '(a b c)) =^)h c) 



(member 'c '(a b c)) => (^| 
(member 'b '(a b c b)) => v ^b b) 



(member 'd '(a b c)) 



Note that member uses eq^o test equality, hence this does not work for strings. 
You should use membe^str^ if the list contains strings. 

(assoc ITEM ALIST) ^ 

a-list are a standard list format for representing feature value pairs. An a-list is 
basically a list of pairs of name arVp value, although the name may be any lisp 
item it is usually an symbol. A typlic^'a-list is 

((name AH) < vA < . 
(duration 0.095) 

(vowel +) /C\ 
(occurs ("fileOl" "file04" "file07" "ffle24")) ^ V 

assoc is a function that allows you to look upvalues in an a-list 

(assoc 'name '((name AH) (duration 0.95))) => (name^AH) 
(assoc 'duration '((name AH) (duration 0.95))) => (dic^tion 0.95) 
(assoc 'vowel '((name AH) (duration 0.95))) => nil ^^^i 

Note that assoc uses eq to test equality, hence missies not work names that 
are strings. You should use assoc_string if the a-list strings for names. 

O 

Arithmetic functions O 

+ - * / exp log sqrt <><=>== * 

I/O functions v? 

File names in Festival use the Unix convention of using "/" as the directory separator. 
However under other operating systems, such as Windows, the "/" will be appropri- 
ately mapped into backslash as required. For most cases you do not need to worry 
about this and if you use forward slash all the time ti will work. 



273 



Chapter 25. Festival's Scheme Programming Language 



(format FD FORMATSTRING . ARGS) 

The format function is a little unusually in Lisp. It basically follows the print f 
command in C, or more closely follows the format function in Emacs lisp. It is 
desgined to print out infomation that is not necessarily to be read in by Lisp (un- 
like ppr int, print and print f p). fd is a file descriptor as created by f open, and 
the result is printed to that. Also two special values are allows there, t causes the 
output to be sent to standard out (which is usually the terminal), nil causes the 
output to be written to a string and returned by the function. Also the variable 
stderr is set to a file descriptor for standard error output. 

The fo?rhat string closely follows the format used in C's printf functions. It is 
actually^tkterpreted by those functions in its implementation, format supports 
the following directives 



Print as in, 



%d C>) 

Print as integeryin hexadecimal 

Print as float 



0/ 0/ 

/o /o 



Convert item to string 
A percent character y> 



%c 



%/ 



Print as double 



Print number as character 



% 



Print as Lisp object 

In addition directive sizes are supported, includingl^ero or space) padding, and 
widths. Explicitly specified sizes as arguments as s are not supported, nor 
is %p for pointers. \§\ 

The %s directive will try to convert the correspondingdtg^ argument to a string 
before passing it to the low level print function. Thus ^ISj: will be printed to 
strings, and numbers also coverted. This form will loose thej^stinction between 
lisp symbols and lisp strings as the quote will not be present in the % s form. In 
general %s should be used for getting nice human output and not for machine 
readable output as it is a lossy print form. ^\ 

In contrast %l is designed to reserve the Lisp forms so they can more easily 
read, quotes will appear and escapes for embedded quote will be plated prop- 
erly. < 

(format t "duration %0.3f\n" 0.12345) => duration 0.123 
(format t "num %d\n" 23) => num 23 
(format t "num %04d\n" 23) => num 0023 



274 



Chapter 25. Festival's Scheme Programming Language 



(pprintfSEXP [FD]) 

Pretty print give expression to standard out (or FD if specified). Pretty printing 
is a technique that inserts newlines in the printout and indentation to make the 
lisp expression easier to read. 

(fopen FILENAME MODE) 

This creates a file description, which can be used in the various 1/ O functions. It 
closely follows C stdio fopen function. The mode may be 

to opjghthe file for reading 

"w" % 

to open Affile for writing 

to open the file^eff the end for writing (so-called, append). 

File I/O in binary ^r OS's that make the distinction), 
Or any combination of these. 

V > ' 

(f close FD) \\ 

Close a file descriptor as creat^Cjby fopen. 

(read) 

Read next s-expression from standVrd^n 

(readfp FD) £ 

Read next s-expression from given file descriptor FD. On end of file it returns 
an sexpression eq to the value returned bwsthe function (eof_val) . A typical 
example use of these functions is ^ 

(let ((ifd (fopen infile "r")) %fS 

(ofd (fopen outfile "w")) \ 

(word)) X 
(while (not (equal? (set! word (readfp ifd)) (eof-vaiyTT\ 

(format ofd "%l\n" (lex.lookup word nil))) 
(fclose ifd) 

(fclose ofd))) ^ 

*6 

(load FILENAME [NOEVAL]) O 

Load in the s-expressions in filename. If noeval is unspecified the s-expressions 
are evaluated as they are read. If noeval is specified and non-nil, load will return 
all s-expressions in the file un-evaluated in a single list. ^\ 

% 

String functions < 

As in many other languages, Scheme has a distinction between strings and 
symbols. String evaluate to themselves and cannot be assigned other values, 
symbols of the print name are equal? while strings of teh same name aren't 
necessarily. 



275 



Chapter 25. Festival's Scheme Programming Language 



In Festival's Scheme, strings are eight bit clean and designed to hold strings of text 
and characters in what ever language is being synthesized. Strings are always treats 
as string of 8 bit characters even though some language may interpret these are 16-bit 
characters. Symbols, in general, should not contain 8bit characters. 



(string-equal STR1 STR2) 

Finds the string of stri and STR2 and returns t if these are equal, and nil other- 
wise. Symbol names and numbers are mapped to string, though you should be 
awareihat the mapping of a number to a string may not always produce what 
you ho^fe for. A number 0 may or may not be mapped to " 0 " or maybe to "0.0" 
such tbrtg^ou should not dependent on the mapping. You can use format to 
map a number ot a string in an explicit manner. It is however safe to pass sym- 
bol namesTo string-equal. In most cases string-equal is the right function to 
use rather than equal? which is must stricter about its definition of equality. 

(string-equal p&lo" "hello") => t 
(string-equal "r\eflo" "Hello") => false 
(string-equal "hdj^p)' 'hello) => t 

(string-append . ARGSr^. 

For each argument coerte-4t to a string, and return the concatenation of all argu- 
ments. * \* 

x 

(string-append "abc" "def") =^-*abcdef" 

(string-append "/usr/local/" vrjip/" "festival") => "/usr/local/bin/festival" 
(string-append "/usr/local/" ttoffUo) => "/usr/local/thello" 
(string-append "abc") => "abc" s\ 
(string-append ) => "" . , 

\ 

(member_string STR LIST) f> 

returns nil if no member of list is stri(^-equal to str, otherwise it returns t. 
Again, this is often the safe way to check rf&mbership of a list as this will work 
properly if str or the members of list are"symbols or strings. 



(member_string "a" '("b" "a" "c")) => t 
(member_string "d" '("b" "a" "c")) => nil 
(member_string "d" '(a b c d)) => t 
(member_string 'a '("b" "a" "c")) => t 



(string-before STR SUBSTR) 



Returns the initial prefix of str up to the first occurrence>pf substr in str. If 
substr doesn't exist within str the empty string is returnea^ 

(string-before "abed" "c") => "ab" 

(string-before "bin/make_labs" "/") => "bin" * 
(string-before "usr /local /bin/make_labs" "/") => "usr" ^\ 
(string-before "make_labs" "/") => "" 

(string-after STR SUBSTR) 

Returns the longest suffix of str after the first occurrence of substr in str. If 
substr doesn't exist within str the empty string is returned. 

(string-after "abed" "c") => "d" 

(string-after "bin/make_labs" "/") => "make_labs" 

276 



Chapter 25. Festival's Scheme Programming Language 



(string-after "usr/bin/make_labs" "/") => "bin/make_labs" 
(string-after "make_labs" "/") => "" 



(length STR) 

Returns the lengh of given string (or list). Length does not coerce its argument 
into a string, hence given a symbol as argument is an error. 

(length "") => 0 
(len|£x"abc") => 3 
(lengWfabc) -> SIOD ERROR 
(lengtf^feb c)) -> 3 

(symbolexplode^S^MBOL) 

returns a list ctf^ingle character strings for each character in symbol}' print 
name. This will ales work on strings. 



(symbolexplode 'aDc))=> ("a" "b" "c") 
(symbolexplode 'hei^j=> ("h" "e" "1" "1" "o") 

% 

(intern STR) 

Convert a string into a syrn^qL with the same print name. 

V 

(string-matches STR REGEX) 

Returns t if str matches regex j^goilar expression. Regular expressions are de- 
scribed more fully below. ^> * 

(string-matches "abc" "a.*") => t 
(string-matches "hello" "[Hh]ello") => t s> 

(5) 

System functions w * 

In order to interact more easily with the underlyirvg\operating system, Festival 
Scheme includes a number of basic function that allow/Scheme programs to make 
use of the operating system functions. 

(system COMMAND) 

Evaluates the command with the Unix shell (or equivalenf$3ts not clear how this 
should (or doesO work on other operating systems so it shouicftbe used sparingly 
if the code is to be portable. 



(system "Is") => lists files in current directory. ^\ 
(system (format nil "cat %s" filename)) 



(get_url URL OFILE) 



Copies contents of url into ofile. It support file: and http: prefixes, but 
current does not support the ftp : protocol. 

(get_url "http:/ /www.festvox.org/index.html" "festvox.html") 



277 



Chapter 25. Festival's Scheme Programming Language 



(setenv NAME VALUE) 

Set environment variable name to value which should be strings 
(setenv "DISPLAY" "nara.mt.cs.cmu.edu:0.0") 



(getenv NAME ) 

Get vaftfe of environment variable name. 

(getenv ^isplay") 



(getpid) 

The process id, *fs a number. This is useful when creating files that need to be 
unique for the festival, instance. 

(set! bbbfile (forma*^7tap/stuff.%05d" (getpid))) 

(cd DIRECTORY) 

Change directory. 

(cd "/tap") ^ 

(pwd) \> 

return a string which is a pathname current working directory. 

Utterance Functions # 

%%%%% Utterance construction and access functi^H^s 

Synthesis Functions 

%%%%% Synthesis specific functions 

Debugging and Help O 

%%%%% backtrace, debugging, advise etc. 

Adding new C++ functions to Scheme v? 

Brief decsription of C++ interface. 



278 



Chapter 25. Festival's Scheme Programming Language 



Regular Expressions 



Regular expressions are fundamentally useful in any text processing language. This is 
also true in Festival's Scheme. The function string-matches and a number of other 
places (notably CART trees) allow th eunse of regular expressions to matche strings. 

We will not go into the formal aspects of regular expressions but just give enough dis- 
cussion to help you use them here. See [regexbook] for probablay more information 
than you'll ever need. 

Each implementation of regex's may be slightly different hence here we will lay out 
the full syi^ict and semantics of the our regex patterns. This is not an arbitrary selec- 
tion, when F^tival was first developed we use the GNU libg++ Regex class but for 
portability td^^n-GNU systems we had replace that with our own impelementation 
based on Henr^-Spencer regex code (which is at the core of many regex libraries). 

In general all character match themselves except for the following which (can) have 
special interpretaSi^i^ 

V 

■ * + ?□-() i A $\ r£ 

If these are preceded Vjj^ backslash then they no longer will have special interpre- 
tation. (^) 

"2 . 

Matches any character. \^ 

(string-matches "abc" "a.c") 
(string-matches "acc" "a.c") => tf\\ 

Matches zero or more occurrences ofHjjejjreceding item in the regex 

(string-matches "aaaac" "a*c") => t 
(string-matches "c" "a*c") => t >cn. 
(string-matches "anythingc" ".*c") => t \v 
(string-matches "canythingatallc" "c.*c") => t # 

Matches one or more occurrences of the precedingT^em in the regex 



(string-matches 
(string-matches 
(string-matches 
(string-matches 
(string-matches 



aaaac" "a+c") => t \§\ 
c" "a*c") => nil >v 
anythingc" ".+c") => t ' 
c" ".+c") => nil Q 
canythingatallc" "c.+c") => t 



(string-matches "cc" "c.+c") => nil 



o 

Matches zero or one occurrences of the preceding item. This is it rff^Ses the pre- 
ceding item optional. 

(string-matches "abc" "ab?c") => t 
(string-matches "ac" "ab?c") => t 



279 



Chapter 25. Festival's Scheme Programming Language 



can defined a set of characters. This can also be used to defined a range. For 
example [aeiou] is and lower case vowel, [a-z] is an lower case letter from a 
thru z. [a-zA-z] is any character upper or lower case. 

If the A is specif ed first it negates the class, thus [ "a-z ] matches anything but a 
lower case character. 

\( \) 

Allov^gctions to be formed to allow other operators to affect them. For exam- 
ple the ^applies to the previous item thus to match zero more occurrences of 
somethigr^onger than a single character 

(string-matches "helloworld" "hello\\(there\\)*world") => t 
(string-matches "hellothereworld" "hello\\(there\\)*world") => t 
(string-matctjjs ''hellotherethereworld" "hello\\(there\\)*world") => t 

Note that you^eed two backslashes, one to escape the other backslashes 

Or operator. Allows^t^ice of two alternatives 

(string-matches "hellof(^vorld" "hello\\(fish\\ I chips\\)world") => t 
(string-matches "hellochip^world" "hello \\ (fish \\ I chips\\)world") => t 

Note that you need two hatJlfslashes, one to escape the other backslashes 

Some Examples v ^>* 

%%%%% some typical example code usage. 



'•6 



o 



% 



280 



Chapter 26. Edinburgh Speech Tools 



Details of wagon, ch_wave, ngram etc stuff. 
Edinburgh Speech Tools 1 

Notes 

1. http: / / festvox.org/ docs/ speech_tools-1.2.0/bookl.htm 

4*. 



% 



O 



Chapter 26. Edinburgh Speech Tools 




182 



Chapter 27. Machine Learning 

decision trees, OLS< SVD, and pointers to Tom's book. 




Chapter 27. Machine Learning 




184 



Chapter 28. Resources 



In this chapter we will try to list some of the important resources available that you 
may need when building a voice in Festival. This list cannot be complete and compre- 
hensive but we will to give references to meta-resources as well as direct references 
to information code, data that may be of use to you. 

This document itself will be updated occasionally and it is worth checking to ensure 
that you have the latest copy. 

Updates, qAv databases, new language support etc will happen intermittently, new 
voices willT5e\released which may help you develop your own new voices. 

http: / / festvo^rg 

has been set up as-e resource center for voices in Festival offering databases, exam- 
ples and repository)fQj> voice distribution. Checking that site regularly is a good thing 
to do. \> 

Specifically ^ » 

http : / / f estvox. org / examp^s / cmu_us_kal_diphone / 



Offers a complete exansjpye US English diphone databes as built using the walk- 
though in Chapter 20. TtieSoriginally recorded diphone databases is also avail- 
able as is, at http: / / festvct^cjcg/ databases / cmu_us_kal_diphone / . 

http: / / festvox.org/ examples /cm^ltime_awb_ldom/ 

Offers a complete example limlt^l domain synthesis database as build using the 
walkthroughs in Chapter 5. ^\ 

Other databases, lexicons etc will be installed on festvox.org as they become avail- 
able. A 

There is also a mailing-list festvox-talteCfestvox.org for discussing aspects of 
building voices. See http:/ /festvox.org/mamiart.html for details of joining it and the 
archive of messages already sent. Also, while traffic is low, feel free to mail the au- 
thors awb@cs . emu . edu or lenzodcs . emu . edu ij^Jfcl we will try to help where we can. 

Festival resources ^ •> 

:.uk/pi?e 

regularly as new developments happen. ^- 

The Festival Speech Synthesis System code and the EcMBurgh Speech Tools library 
and related programs are available from \J *. 

<> 

ftp:/ /ftp. cstr.ed.ac.uk/pub/festival/ 
or in the US at v 

http: / / www.festvox.org/ festival/ downloads.html ^\ 

Note that precompiled versions of the system are also available fljp*j} that site, 
though at time of writing only Linux binaries are available. 

Festival comes with its own manual and html, postscript and GNU info format. It and 
a less comprehensive Speech Tools manual are pre-built in f estdoc-1 . 4 . 1 . tar . gz. 
The manuals are also available on line at 



The Festival home page http:/ /www.cstr.ed.ac.uk/p|^cts/festival/ It is updated 



http: / / www.festvox.org/ docs/ manual-1.4.2/ festival_toc.html 
http: / / www.festvox.org/ docs/ speech_tools-1.2.0/bookl.htm 1( 



185 



Chapter 28. Resources 



You will likely need to reference these manuals often. 

It will also be useful to have access to other voices development in Festival as seeing 
how others solve problems may make things clearer. 

In addition to Festival itself a number of other projects throughout the world use 
Festival and have also released resources. The "Related Projects" links give urls to 
other organizations which you may find useful. 

It is worth mentioning Oregon Graduate Institute here who have done a lot of work 
with the system and release other voices for it (US English and Mexican Spanish). See 
http:/ / csb&cge. ogi.edu/ tts/ for more details. 



A second ^project worth mention, is the MBROLA project [dutoit96] 
http: //tcts.fpB&. ac.be/synthesis/mbrola.html, they offer a waveform synthesis 
technique [duroit93] and a number of diphone database for lots of different 
languages. MBROLA itself doesn't offer a front end, just phone, duration and FO 
target to wavefor^kynthesis. (However the do offer a full French TTS system too.) 
Their diphone datablvses complement Festival well and a number of projects use 
MBROLA databases^mr their waveform synthesis and Festival as the front end. If 
you lack resources tovecprd and build diphone databases this is a good place to 
check for existing diplf5pe databases for languages. Most of their databases have 
some use/ distribution re^jctions but they usually allow any non-commercial use. 

General speech resources 

The network is a vast resource of^feformation but it is not always easy to find what 
you are looking for. <r 

Indexes to speech related informanocirare available. The comp. speech frequently 
asked questions maintain by Andrewstiunt, is an excellent constantly updated list of 
information and resrouces available fo^speech recognition and synthesis. It is avail- 
able in html format from ^\ 

Australia: http:/ / www.speech.su.oz. au/corrrp^spifiech/ 
UK: http:/ /svr-www.eng. cam.ac.uk/comp. speech/ 
Japan: http:/ / www.itl.atr.co.jp/ comp. speech/ (V) 
USA: http:/ /www.speech.cs. cmu.edu/comp. speech/ 



'•6 



The Linguistics Data Consortium (LDC), although^xpensive, offers many speech 
resources including lexicons and databases suitable f^synthesis work. There web 
page is http: / / www.ldc.upenn.edu A similar organizations the European Language 
Resources Association http:/ /www.icp. grenet.fr /ELRAb^ibme. html which is based 
in Europe. Both these home pages have links to other pote^^al resources. 

o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ 
/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 

to be added: 



recording and EGG information 

o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ o/ 
/o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o /o 



Notes 



1. http://festvox.org 

2. http: / / festvox.org/ examples/cmu_us_kal_diphone/ 

3. http: / / festvox.org/ databases/cmu_us_kal_diphone/ 

4. http: / / festvox.org/ examples /cmu_time_awb_ldom/ 



% 



286 



Chapter 28. Resources 



5. http:/ /festvox.org/maillist.html 

6. http:/ / www.cstr.ed. ac.uk/projects/festival/ 

7. f tp : / / f tp .cstr.ed . ac .uk/ pub / festival / 

8. http: / / www.festvox.org/ festival/ downloads.html 

9. http: / /www.festvox.org/ docs/manual-1.4.2/festival_toc.html 

10. http://www.festvox.Org/docs/speech_tools-l.2.0/bookl.html 

11. http://pslu.cse.ogi.edu/tts/ 

12. http:/ /fgsipms. ac.be/synthesis/mbrola.html 

13. http: / / wwM^speech.su. oz.au/ comp. speech/ 

14. http:/ /svr-www.eng.cam.ac.uk/comp. speech/ 

1 5 . http : / / www. i^l^tr. co . jp / comp .speech / 

16. http: //www.spe^K.cs. cmu.edu/comp. speech/ 

17. http://www.ldc.;i^nn.edu 

18. http://www.icp.gr e^t.fr/ELRA/home.html 





0 





o 





187 



Chapter 28. Resources 




188 



Chapter 29. Tools Installation 

This chapter describes the installation method for Festival, the Festvox voice build- 
ing tools and related system as used throughout this book. Binary distributions are 
included for standard systems, but full source is also provided as are instrction on 
installation. 

These tools are included on the CD as distributed with the book, though all parts 
(and possible later releases of them) are also available from http : / /festvox . org. 



Notes 




289 



Chapter 29. Tools Installation 




190 



Chapter 30. English phone lists 



Relating phonemes to sounds is not obvious as people think. Even when one is fami- 
lar with phone sets its easy to make mistakes when reading lists of phones alone. This 
is particularly true in reading diphone nonsense words. The table provided here are 
intended for both the experienced and inexperienced reader of phones, to help you 
decide on the pronunciation. 

These tables are not supposed to be a substitute for a good phonetics course, they are 
intended to^give people a basic idea of the pronunciation of the phone sets used in 
the partica^Jl* examples in this document. Many simplifying assumptions have been 
made, and of^n aren't even mentioned. To the phoneticians out there I apologise, as 
much as the assumptions are wrong we are here listing atomic discrete phones which 
we have founcretseful in building practical systems, even though better sets probably 
exist. • 



Inspite of everyone tef[jjtg you that there is one and only one US phoneset, when it 
comes to actually using^dne you quickly discover there are actually many standard 
one used by lots of differ^jwpieces of software, often the difference betwen them is 
trivial (e.g. case folding) bu^omputers being fundamentally dumb can't take these 
trivial differences into accouni^Here we list the radio phoneset which is used by stan- 
dard US voices in festival. The definition is in f estival/lib/radio_phones . scm. 
This list was based on those phones that appear in the Boston University FM radio 
corpus with minor modificationsYThe list here is exactly those phones which appear 
in the diphone nonses words as u^ecivn the example explained in Chapter 20. 



US phoneset 




ae 



aa 



f At, bAd 



f Ather, wAshington 




ah 



ao 



bUt, hUsh 




lAWn, dOOr, mAll 



o 




aw 




hOW, sOUth, brOWser 



ax 



About, cAnoe 



ay 




hide, bible 



o 




eh 



gEt, fEAther 



el 



tabLE, usabLE 



292 



Chapter 30. English phone lists 



em 

en 

er 

ey 

ih 

iy 

ow 
oy 



systEM, communisM 



beatEN 



fERtile, sEARch, makER 

\ 

gAte, At ^^ 
bit, ship \$i 
bEAt, shEEp \^ 
lOne, nOse O 



tOY, OYster 

uh "vS* 



uw 



ch 



dh 



f 



hh 
jh 

192 



fUll, wOOd 
fOOl, fOOd 
Book, aBrupt 



6 



CHart, larCH y> 

o 

Done, baD 

THat, faTHer Q 



Fat, lauGH 



O 

Good, biGGer \\ 



Hello, loopHole 
diGit, Jack 



Chapter 30. English phone lists 



k 

Camera, jaCK, Kill 

/ 

Late, fuLL 

m 

Man, gaMe 

" \ 

maN, Nev^ 
baNG, sittiN(Q^ 
Pat, camPer \sK 
Reason, caR, $ 



r 



s 

Sit, maSS 

sh 

SHip, claSH 

f 

Tap, baT " ^ 



f/z 

THeatre,baTH 



C3 



I' 



Various, haVe 



w 

Water, cob Web 



o 

y 

Yellow, Yacht Q 



z 



Zero, quiZ, boyS 

zh 



O 

viSion, caSual 

pan 

short silence 

In addition to the phone sthemselves the nonsense word generated by the diphone 
schema also have some other notations to denote different type of phone. 



293 



Chapter 30. English phone lists 



The use of - (hyphen) in the nonsense word itself is used to denot an explicit syllable 
boundary. Thus pau t aa n - k aa pau is used to state that the word should be 
pronounced as tan ka rather than tank ah. Where no explicit syllable boundary 
is given the pronunciation should be pronounce naturally without any boundary 
(which is probabaly too underspecified in some cases). 

The use of _ (underscore) in phone names is used to denote consonant clusters. That 
is t_-_r is the /tr/ as found in trip not that in oat run. 



UK phoneset 

This phoneset developed at CSTR a number of years ago is for Southern UK English 
(RP, "received pronunciation"). Its definition is in f estival/lib/mrpa_phones . scm. 

Uh ^ 

cUp, dOne (J) . 

V 

bEt,chEck 

cAt, mAtch V* 
cOttage, hOt ^ 



a 



o 



pUll, fOOt, bOOk ^) 



i 

bit, ship 

u 



u 

bEAt, shEEp 



uu 



o 

pOOl, bOOt ^ 



oo 

AUthor, cOURt 

an 

ARt, hEARt 

@@ 

sEARch, bURn 

at 

bite, might, like 

ei 

Ate, mAIl 



% 



294 



Chapter 30. English phone lists 



01 

tOY, OYster 

au 

sOUth, hOW 

on 

hOle, cOAt 

e@ ^\ 

AIR, bAK& chAIR 

i@ , 

EAR, bEER \^ 

sUre, jUry -S} 

® CO 

About, arlAs, equipmEv^t) 

v y- 

Pat, camPer 

t ^ 
Tap, baT 

/c 

Camera, jaCK, Kill " 



(5) 

o 

^> 

Sit, maSS 



b 

Book, aBrupt 

d 

Done, baD 

g 

Good, biGGer 

s 



z 



Zero, quiZ, boyS 

sh 



o 
o. 



SHip, claSH ^> 

z/i 

viSion, caSual 

/ 

Fat, lauGH 



295 



Chapter 30. English phone lists 



Various, haVe 

tlx 

THeatre, baTH 




jh 

diGit, Jack ^ 

Hello, loopHole \^ 

m (^) 

Man, gaMe 



n 

maN, New 

ng 

baNG, sittiNG 

/ 

Late, bLack 

V 

Yellow, Yacht 



Reason, caReer, 

w 



'•6 



Water, cob Web 



short silence 



O 

In addition to the phone sthemselves the nonsense word generated by the diphone 
schema also have some other notations to denote different type of p^Tcjie. 

The use of - (hyphen) in the nonsense word itself is used to denot an ^xj)licit syllable 
boundary. Thus pau t aa n - k aa pau is used to state that the wo^dishould be 
pronounced as tan ka rather than tank ah. Where no explicit syllabl^boundary 
is given the pronunciation should be pronounce naturally without any boundary 
(which is probabaly too underspecified in some cases). 

The use of _ (underscore) in phone names is used to denote consonant clusters. That 
is t_-_r is the /tr/ as found in trip not that in cat run. 



296 



